Computer-use native input unreliable: pixel↔screen coordinate mismatch blocks small UI targets for browser agents #2200

Closed
opened 2026-06-04 03:20:08 +00:00 by RenoStarsAI-production-client · 6 comments

Summary

Platform browser/desktop agents (the desktop_* computer-use tools driving Falkon/Firefox on Xvfb :99) see UI targets correctly but cannot reliably click them. Clicks on small targets land in the wrong region because the screenshot's pixel space does not map 1:1 to the click-coordinate space (HiDPI/scaling). This silently blocks routine tasks and forces a human/operator fallback, defeating the agent.

Evidence (2026-06-04, Marketing Agent peer 6b66de8d-9337-4fb4-be8d-6d49dca0d809)

Same failure mode on two different sites in one session:

  • Facebook "Post" button — 3+ attempts at different y-coords (526, 1035, 1052) all missed the button and hit the modal dead-zone, triggering "Save as draft" instead of posting. The agent reasoned the screenshot showed the button at display-y≈263 and tried 2:1 and 4:1 scale factors — both wrong.
  • GBP "Posts" icon — 3 attempts (315/220, 316/232, 316/235) all landed on the Google search bar above the icon row and opened autocomplete instead.

The agent correctly stopped after 2-3 misses (good), but the task could not proceed via native input. In both cases an operator-side DOM click (CDP element.click(), no coordinates) succeeded immediately — confirming the issue is coordinate translation, not perception.

Root cause

screenshot pixel space ≠ click coordinate space. Either the Xvfb framebuffer renders at a device-pixel-ratio > 1, or the screenshot is downscaled before it reaches the model, while the click API consumes logical points. The scale factor is not surfaced to the agent, so it guesses (2:1, 4:1) and misses. On small targets ringed by dead-zones (modals, icon rows) the miss is fatal.

Proposed fixes (priority order)

  1. Pin the display to DPR=1 at a fixed, known resolution and deliver screenshots at exactly that pixel size with no resampling. Then screenshot (x,y) == screen (x,y) and the whole class of bug disappears. Highest leverage — likely resolves the bulk on its own.
  2. If 1:1 isn't guaranteed, do the transform platform-side — accept clicks in screenshot-pixel space and scale internally, or attach {screenshot_w, screenshot_h, display_w, display_h} metadata to every frame. The agent should never have to infer DPI.
  3. Semantic targeting primitivelocate(description) -> (x,y) backed by the OS/browser accessibility tree, so a native click can aim at a resolved element center instead of a guessed pixel. Preserves the "humanlike native input, works in any app" design intent while making it robust to scaling, scroll, and layout shifts. (This is exactly why the operator-side DOM .click() worked.)
  4. Post-click hit-test feedback — after a click, report which element was hit, or "empty space; nearest interactive element at (x,y)", so the agent self-corrects in one step instead of blind-retrying.
  5. Render the cursor into screenshots and deliver lossless full-res (PNG) frames so small targets don't blur.

Impact

Every supervised posting/automation task on real sites currently stalls on this; the workaround is an operator publishing via DOM from a separate machine, which defeats the point of an autonomous browser agent. Fix #1 alone is small and likely high-impact.

Related

  • #3034 (browser sandbox / run-as-root hardening) — same agent, same desktop_* stack.
## Summary Platform browser/desktop agents (the `desktop_*` computer-use tools driving Falkon/Firefox on Xvfb `:99`) **see** UI targets correctly but cannot reliably **click** them. Clicks on small targets land in the wrong region because the screenshot's pixel space does not map 1:1 to the click-coordinate space (HiDPI/scaling). This silently blocks routine tasks and forces a human/operator fallback, defeating the agent. ## Evidence (2026-06-04, Marketing Agent peer `6b66de8d-9337-4fb4-be8d-6d49dca0d809`) Same failure mode on two different sites in one session: - **Facebook "Post" button** — 3+ attempts at different y-coords (526, 1035, 1052) all missed the button and hit the modal *dead-zone*, triggering "Save as draft" instead of posting. The agent reasoned the screenshot showed the button at display-y≈263 and tried 2:1 and 4:1 scale factors — both wrong. - **GBP "Posts" icon** — 3 attempts (315/220, 316/232, 316/235) all landed on the Google **search bar** above the icon row and opened autocomplete instead. The agent correctly stopped after 2-3 misses (good), but the task could not proceed via native input. In both cases an operator-side DOM click (CDP `element.click()`, no coordinates) succeeded immediately — confirming the issue is **coordinate translation, not perception**. ## Root cause `screenshot pixel space ≠ click coordinate space`. Either the Xvfb framebuffer renders at a device-pixel-ratio > 1, or the screenshot is downscaled before it reaches the model, while the click API consumes logical points. The scale factor is not surfaced to the agent, so it guesses (2:1, 4:1) and misses. On small targets ringed by dead-zones (modals, icon rows) the miss is fatal. ## Proposed fixes (priority order) 1. **Pin the display to DPR=1 at a fixed, known resolution and deliver screenshots at exactly that pixel size with no resampling.** Then screenshot `(x,y)` == screen `(x,y)` and the whole class of bug disappears. Highest leverage — likely resolves the bulk on its own. 2. **If 1:1 isn't guaranteed, do the transform platform-side** — accept clicks in screenshot-pixel space and scale internally, or attach `{screenshot_w, screenshot_h, display_w, display_h}` metadata to every frame. The agent should never have to infer DPI. 3. **Semantic targeting primitive** — `locate(description) -> (x,y)` backed by the OS/browser **accessibility tree**, so a *native* click can aim at a resolved element center instead of a guessed pixel. Preserves the "humanlike native input, works in any app" design intent while making it robust to scaling, scroll, and layout shifts. (This is exactly why the operator-side DOM `.click()` worked.) 4. **Post-click hit-test feedback** — after a click, report which element was hit, or "empty space; nearest interactive element at (x,y)", so the agent self-corrects in one step instead of blind-retrying. 5. **Render the cursor into screenshots** and deliver lossless full-res (PNG) frames so small targets don't blur. ## Impact Every supervised posting/automation task on real sites currently stalls on this; the workaround is an operator publishing via DOM from a separate machine, which defeats the point of an autonomous browser agent. Fix #1 alone is small and likely high-impact. ## Related - #3034 (browser sandbox / run-as-root hardening) — same agent, same `desktop_*` stack.
Author

Field repro (2026-06-04) — the exact coordinate math

Concrete numbers from the failing session, since they pin the bug precisely.

Agent display (Xvfb :99) = 1920×1080. On the Facebook composer the agent read the Post button at y ≈ 263 in its screenshot. Assuming the screenshot was a downscaled copy of the real screen, it scaled up by guessing a multiplier:

  • 263 × 2 = 526 → clicked, missed (too low)
  • 263 × 4 = 1052 → clicked, missed badly — 1052 of a 1080-tall screen is ~28px from the bottom edge

The button was at ~263 the whole time. If the screenshot the model receives is the same 1920×1080 as the display (1:1), the correct action was to click 263 directly and never multiply. The agent multiplied only because it had no way to know the screenshot:display scale, so it guessed — and every guess overshot down the Y axis. Same root cause produced the GBP miss (clicks meant for the "Posts" icon landed on the search bar above it).

This is exactly what fix #1 resolves: pin the framebuffer to a known size, deliver the screenshot at that exact size (no resampling), and surface the dimensions — then "263" means "263" and there is nothing to guess.

Cheap immediate mitigation (before the full fix lands): include screenshot_width × screenshot_height in every observation alongside the display resolution. Then the agent computes the exact factor display/screenshot (often 1.0) instead of trying 2× vs 4×. We're applying this as a manual workaround in the agent's runbook now (measure screenshot dims → if 1:1, click raw coordinates).

### Field repro (2026-06-04) — the exact coordinate math Concrete numbers from the failing session, since they pin the bug precisely. Agent display (Xvfb `:99`) = **1920×1080**. On the Facebook composer the agent read the **Post button at y ≈ 263** in its screenshot. Assuming the screenshot was a downscaled copy of the real screen, it scaled up by *guessing* a multiplier: - 263 × 2 = **526** → clicked, missed (too low) - 263 × 4 = **1052** → clicked, missed badly — `1052` of a `1080`-tall screen is **~28px from the bottom edge** The button was at ~263 the whole time. **If the screenshot the model receives is the same 1920×1080 as the display (1:1), the correct action was to click 263 directly and never multiply.** The agent multiplied only because it had no way to know the screenshot:display scale, so it guessed — and every guess overshot down the Y axis. Same root cause produced the GBP miss (clicks meant for the "Posts" icon landed on the search bar above it). This is exactly what fix #1 resolves: pin the framebuffer to a known size, deliver the screenshot at that exact size (no resampling), and surface the dimensions — then "263" means "263" and there is nothing to guess. **Cheap immediate mitigation** (before the full fix lands): include `screenshot_width × screenshot_height` in every observation alongside the display resolution. Then the agent computes the exact factor `display/screenshot` (often 1.0) instead of trying 2× vs 4×. We're applying this as a manual workaround in the agent's runbook now (measure screenshot dims → if 1:1, click raw coordinates).
Owner

Root cause found + fix in flight.

The desktop tools are actually 1:1 internally — tool_desktop_screenshot captures with scrot (full display, no resize) and tool_desktop_click clicks with xdotool mousemove in the same native-pixel space. The mismatch is outside the tools: the desktop ran at 1920x1080 = 2.07 MP, and Claude's vision silently downscales any image above ~1.15 MP / 1568px long edge before the model reasons over it (~1430x804). So the model reads a coordinate off the downscaled image, xdotool clicks it in full-res space, and the click lands elsewhere — exactly the 2:1 / 4:1 scale-guessing seen in the incident. (Operator-side CDP element.click() worked because it never uses pixel coords — confirming translation, not perception.)

Fix (two PRs):

  • molecule-controlplane#516 — default the desktop-control display to WXGA 1280x800 (Anthropic's recommended computer-use resolution: 1.02 MP, 1280<1568 → no downscale → screenshot(x,y) == screen(x,y) == click(x,y) 1:1). Changes both the authoritative handler default and the boot-script fallback/modeline.
  • molecule-ai-workspace-runtime#89tool_desktop_screenshot now surfaces the exact pixel space (width/height/vision_safe) so the agent never infers DPI, warns loudly if a display ever exceeds the vision bound, and sizes the browser window to 1280x800.

Unit tests cover both halves; live e2e (recreate a desktop agent at 1280x800, screenshot → verify dims → click a small target → verify it lands) to follow once both are deployed.

**Root cause found + fix in flight.** The desktop tools are actually 1:1 internally — `tool_desktop_screenshot` captures with `scrot` (full display, **no resize**) and `tool_desktop_click` clicks with `xdotool mousemove` in the same native-pixel space. The mismatch is **outside** the tools: the desktop ran at **1920x1080 = 2.07 MP**, and Claude's vision **silently downscales** any image above ~1.15 MP / 1568px long edge before the model reasons over it (~1430x804). So the model reads a coordinate off the *downscaled* image, xdotool clicks it in *full-res* space, and the click lands elsewhere — exactly the 2:1 / 4:1 scale-guessing seen in the incident. (Operator-side CDP `element.click()` worked because it never uses pixel coords — confirming translation, not perception.) **Fix (two PRs):** - **molecule-controlplane#516** — default the desktop-control display to **WXGA 1280x800** (Anthropic's recommended computer-use resolution: 1.02 MP, 1280<1568 → no downscale → `screenshot(x,y) == screen(x,y) == click(x,y)` 1:1). Changes both the authoritative handler default and the boot-script fallback/modeline. - **molecule-ai-workspace-runtime#89** — `tool_desktop_screenshot` now surfaces the exact pixel space (`width`/`height`/`vision_safe`) so the agent never infers DPI, warns loudly if a display ever exceeds the vision bound, and sizes the browser window to 1280x800. Unit tests cover both halves; live e2e (recreate a desktop agent at 1280x800, screenshot → verify dims → click a small target → verify it lands) to follow once both are deployed.
Author

Correction: the scale factor is ×2 (model view-layer), not 1:1

Field update from the agent — refines my previous comment, which was partly wrong:

  1. The Xvfb framebuffer screenshot is 1920×1080 (confirmed via PNG header), so the framebuffer→screenshot path is already 1:1. My earlier "click 263 raw / 1:1" was therefore wrong for this setup.
  2. The mismatch is one layer further out: the agent's chat/observation view renders the screenshot at ~2:1 (≈960×540). So the model reads coordinates in HALF-size space and must ×2 to hit the real 1080 screen.

Empirically confirmed — clicks succeeded after applying ×2:

  • composer: view (608, 435) → actual (1216, 870) ✓
  • Post button: view (480, 506) → actual (960, 1012) ✓ (FB post published)

So the correct factor here is ×2; ×4 overshot below the modal, and 1:1 would have undershot. Root cause is unchanged — the agent has no declared scale between what it observes and the real screen, so it guesses — but the offending layer is the model view / chat rendering, not the framebuffer. The fix sharpens to: surface the view→screen scale factor (or deliver the observation at native res, or accept clicks in view-space and scale internally). The a11y-targeting primitive (#3) sidesteps it entirely regardless of scale.

### Correction: the scale factor is **×2** (model view-layer), not 1:1 Field update from the agent — refines my previous comment, which was partly wrong: 1. The Xvfb framebuffer screenshot **is** 1920×1080 (confirmed via PNG header), so the framebuffer→screenshot path is already 1:1. My earlier "click 263 raw / 1:1" was therefore **wrong** for this setup. 2. The mismatch is one layer further out: the agent's **chat/observation view renders the screenshot at ~2:1** (≈960×540). So the model reads coordinates in HALF-size space and must ×2 to hit the real 1080 screen. Empirically confirmed — clicks succeeded after applying ×2: - composer: view (608, 435) → actual (1216, 870) ✓ - Post button: view (480, 506) → actual (960, 1012) ✓ (FB post published) So the correct factor here is **×2**; ×4 overshot below the modal, and 1:1 would have undershot. **Root cause is unchanged** — the agent has no *declared* scale between what it observes and the real screen, so it guesses — but the offending layer is the **model view / chat rendering**, not the framebuffer. The fix sharpens to: surface the view→screen scale factor (or deliver the observation at native res, or accept clicks in view-space and scale internally). The a11y-targeting primitive (#3) sidesteps it entirely regardless of scale.
Owner

Fixed + e2e-verified. Both PRs merged:

  • molecule-controlplane#516 (merged fb308392) — desktop-control display default 1920x1080 → WXGA 1280x800; deploying to prod via the pipeline.
  • molecule-ai-workspace-runtime#89 (merged 9f3d0562) — tool_desktop_screenshot surfaces width/height/vision_safe; warns above the bound.

Live e2e (real Xvfb display + real scrot/xdotool + the merged runtime tool, both resolutions)

Ran the actual display stack in a throwaway container, captured real screenshots, and classified them with the merged tool_desktop_screenshot code:

display scrot PNG pixels vision_safe result
1280x800 (new default) 1280x800 1,024,000 (1.02 MP) True ≤ vision bound → delivered 1:1 → coords read off the screenshot == xdotool click coords → click lands
1920x1080 (old default) 1920x1080 2,073,600 (2.07 MP) False + warning > ~1.15 MP / 1568px → Claude downscales before the model sees it → coord desync → click misses

Cursor parked at the screenshot-derived center (742,382) of a target window at (560,320,364x124): in the 1280x800 capture the cursor sits exactly on the target (verified visually) — i.e. a coordinate read off the screenshot maps 1:1 to the click. This is precisely the 2:1/4:1 scale-guessing from the incident, now eliminated by keeping the capture within Claude's native-vision bound.

Remaining: prod promote of the CP pipeline (in flight) → new desktop agents get 1280x800 automatically. Existing desktop agents have 1920 baked into /etc/molecule.env and must be recreated (not self-healing). Runtime transparency layer ships in the next runtime release (0.3.9).

**Fixed + e2e-verified.** Both PRs merged: - **molecule-controlplane#516** (merged `fb308392`) — desktop-control display default 1920x1080 → **WXGA 1280x800**; deploying to prod via the pipeline. - **molecule-ai-workspace-runtime#89** (merged `9f3d0562`) — `tool_desktop_screenshot` surfaces `width`/`height`/`vision_safe`; warns above the bound. ### Live e2e (real Xvfb display + real `scrot`/`xdotool` + the merged runtime tool, both resolutions) Ran the actual display stack in a throwaway container, captured real screenshots, and classified them with the **merged** `tool_desktop_screenshot` code: | display | scrot PNG | pixels | `vision_safe` | result | |---|---|---|---|---| | **1280x800** (new default) | 1280x800 | 1,024,000 (1.02 MP) | **True** | ≤ vision bound → delivered 1:1 → coords read off the screenshot == xdotool click coords → **click lands** | | 1920x1080 (old default) | 1920x1080 | 2,073,600 (2.07 MP) | **False** + warning | > ~1.15 MP / 1568px → Claude downscales before the model sees it → coord desync → **click misses** | Cursor parked at the **screenshot-derived center (742,382)** of a target window at (560,320,364x124): in the 1280x800 capture the cursor sits exactly on the target (verified visually) — i.e. a coordinate read off the screenshot maps 1:1 to the click. This is precisely the 2:1/4:1 scale-guessing from the incident, now eliminated by keeping the capture within Claude's native-vision bound. **Remaining:** prod promote of the CP pipeline (in flight) → new desktop agents get 1280x800 automatically. Existing desktop agents have 1920 baked into `/etc/molecule.env` and must be **recreated** (not self-healing). Runtime transparency layer ships in the next runtime release (0.3.9).
Author

Recommended fix, consolidated: deliver the observation at native resolution

Now that the mechanism is pinned, the #1 fix is narrower than my original list:

The framebuffer screenshot is already full-res (1920×1080, PNG header confirms). The loss happens in the observation/view layer that presents the image to the model — it downscales ~2:1 to ~960×540. So the model reads coordinates in half-size space and must scale up, which (a) forces a guessed multiplier and (b) makes small targets (a 24px icon) unhittable because a few px of read error doubles into a miss.

Primary ask: stop downscaling the observation before the model sees it — feed the native 1920×1080 (or whatever the framebuffer is) and report its dimensions. Then view == screen == file (1:1), no scaling, and small/dense targets are directly readable. This single change resolves both the FB-modal and GBP-icon failures.

Secondary (still useful): the accessibility-tree locate(label) → coords primitive — sidesteps pixels entirely for targets that are tiny or overlapped by another element's hit-zone (the GBP "Posts" icon sits under Google's search-bar hit-region, so even a pixel-perfect click can land in search).

Agent-side mitigation we shipped today (so we're not blocked waiting): the agent now reads small-target coordinates from the full-res PNG file on disk (crop/zoom/template-match) instead of its downscaled view, prefers bigger labelled entry points, and does click→verify→nudge. Works, but it's a workaround for the missing native-res observation — the platform fix is the real solve.

### Recommended fix, consolidated: **deliver the observation at native resolution** Now that the mechanism is pinned, the #1 fix is narrower than my original list: The framebuffer screenshot is already full-res (1920×1080, PNG header confirms). The loss happens in the **observation/view layer that presents the image to the model** — it downscales ~2:1 to ~960×540. So the model reads coordinates in half-size space and must scale up, which (a) forces a guessed multiplier and (b) makes small targets (a 24px icon) unhittable because a few px of read error doubles into a miss. **Primary ask: stop downscaling the observation before the model sees it — feed the native 1920×1080 (or whatever the framebuffer is) and report its dimensions.** Then view == screen == file (1:1), no scaling, and small/dense targets are directly readable. This single change resolves both the FB-modal and GBP-icon failures. **Secondary (still useful): the accessibility-tree `locate(label) → coords` primitive** — sidesteps pixels entirely for targets that are tiny or overlapped by another element's hit-zone (the GBP "Posts" icon sits under Google's search-bar hit-region, so even a pixel-perfect click can land in search). **Agent-side mitigation we shipped today** (so we're not blocked waiting): the agent now reads small-target coordinates from the **full-res PNG file on disk** (crop/zoom/template-match) instead of its downscaled view, prefers bigger labelled entry points, and does click→verify→nudge. Works, but it's a workaround for the missing native-res observation — the platform fix is the real solve.
Owner

Resolved in production

Root cause: desktop agents capture with scrot (full display, no resize) and click with xdotool in the same native-pixel space — 1:1 by construction. The misses came from the display defaulting to 1920×1080 (2.07 MP), which exceeds Claude's vision auto-downscale bound (~1.15 MP / 1568px): the model reasoned over a silently-downscaled screenshot and clicked in full-res space.

Fix (merged + deployed):

  • molecule-controlplane#516 (fb308392) — desktop-control display default → WXGA 1280×800 (1.02 MP, ≤ bound → no downscale → screenshot(x,y) == screen(x,y) == click(x,y)). Deployed to production (pipeline promoted green; deployed HEAD 6b84f110 carries the commit; workspace_provision.go default is req.Width = 1280, asserted by a passing unit test).
  • molecule-ai-workspace-runtime#89 (9f3d0562) — tool_desktop_screenshot surfaces width/height/vision_safe and warns above the bound (backstop so a misconfigured display is never silently wrong).

E2E (real Xvfb + real scrot/xdotool + the merged tool code, both resolutions):

display scrot PNG vision_safe result
1280×800 1.02 MP True cursor at screenshot-derived (742,382) lands on the target window — click lands
1920×1080 2.07 MP False + warning the downscale that caused the misses

Non-blocking follow-throughs (separate from the click-miss fix, which is done):

  1. Runtime transparency backstop ships when the next runtime release (0.3.9) is cut + a template pins it — tracked with the pre-existing consumer-drift (templates at 0.3.7 vs SSOT 0.3.8).
  2. Any desktop agent provisioned before this deploy has 1920 baked into /etc/molecule.env and must be recreated (not self-healing). (No desktop-control workspaces were running at fix time.)
  3. cp#518 — serving-e2e should retry transient upstream 5xx (a Google 503 hard-blocked this merge).

Closing as resolved + e2e-verified in production.

## Resolved in production ✅ **Root cause:** desktop agents capture with `scrot` (full display, no resize) and click with `xdotool` in the same native-pixel space — 1:1 by construction. The misses came from the display defaulting to **1920×1080 (2.07 MP)**, which exceeds Claude's vision auto-downscale bound (~1.15 MP / 1568px): the model reasoned over a silently-downscaled screenshot and clicked in full-res space. **Fix (merged + deployed):** - **molecule-controlplane#516** (`fb308392`) — desktop-control display default → **WXGA 1280×800** (1.02 MP, ≤ bound → no downscale → `screenshot(x,y) == screen(x,y) == click(x,y)`). **Deployed to production** (pipeline promoted green; deployed HEAD `6b84f110` carries the commit; `workspace_provision.go` default is `req.Width = 1280`, asserted by a passing unit test). - **molecule-ai-workspace-runtime#89** (`9f3d0562`) — `tool_desktop_screenshot` surfaces `width`/`height`/`vision_safe` and warns above the bound (backstop so a misconfigured display is never silently wrong). **E2E (real Xvfb + real `scrot`/`xdotool` + the merged tool code, both resolutions):** | display | scrot PNG | `vision_safe` | result | |---|---|---|---| | 1280×800 | 1.02 MP | **True** | cursor at screenshot-derived (742,382) lands on the target window — **click lands** ✅ | | 1920×1080 | 2.07 MP | **False** + warning | the downscale that caused the misses | **Non-blocking follow-throughs (separate from the click-miss fix, which is done):** 1. Runtime transparency backstop ships when the next runtime release (0.3.9) is cut + a template pins it — tracked with the pre-existing `consumer-drift` (templates at 0.3.7 vs SSOT 0.3.8). 2. Any desktop agent provisioned **before** this deploy has 1920 baked into `/etc/molecule.env` and must be **recreated** (not self-healing). (No desktop-control workspaces were running at fix time.) 3. cp#518 — serving-e2e should retry transient upstream 5xx (a Google 503 hard-blocked this merge). Closing as resolved + e2e-verified in production.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2200