This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Affected surfaces
ReleasePort's take
Light signalThe `computer_use_remote` tool now offers cross‑platform desktop control with mandatory visual verification and returns screenshots as multimodal vision messages.
Why it matters: All state‑changing desktop actions require a fresh screenshot for verification, ensuring reliable automation across macOS, Windows, and Linux platforms.
Summary
AI summarycomputer_use_remote now provides full cross‑platform desktop control with mandatory visual verification and multimodal screenshot output.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Exposes `computer_use_remote` as a callable tool in live sessions. Exposes `computer_use_remote` as a callable tool in live sessions. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Feature | Medium |
Requires visual verification via fresh screenshot for all state‑changing desktop actions. Requires visual verification via fresh screenshot for all state‑changing desktop actions. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Feature | Medium |
Returns screenshots as multimodal vision messages instead of text summaries. Returns screenshots as multimodal vision messages instead of text summaries. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Feature | Medium |
Separates host desktop control (`computer_use_remote`) from Xpra Docker environment. Separates host desktop control (`computer_use_remote`) from Xpra Docker environment. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Feature | Medium |
Adds platform‑specific structural targeting skills for macOS (AX), Windows (UIA), and Linux (AT‑SPI/Wayland). Adds platform‑specific structural targeting skills for macOS (AX), Windows (UIA), and Linux (AT‑SPI/Wayland). Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Feature | Medium |
Updates window‑hide guidance to prefer `Super+H` over `Alt+F9` on Ubuntu/GNOME/Wayland. Updates window‑hide guidance to prefer `Super+H` over `Alt+F9` on Ubuntu/GNOME/Wayland. Source: llm_adapter@2026-05-23 Confidence: low |
— |
| Feature | Medium |
Prunes older capture payloads to prevent runaway context growth. Prunes older capture payloads to prevent runaway context growth. Source: llm_adapter@2026-05-23 Confidence: low |
— |
| Bugfix | Medium |
Preserves vision inputs in Codex OAuth proxy by converting image parts to `input_image` API fields. Preserves vision inputs in Codex OAuth proxy by converting image parts to `input_image` API fields. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Bugfix | Medium |
Handles macOS approval denial by mapping `COMPUTER_USE_APPROVAL_REQUIRED` to re‑arm‑required stop flow. Handles macOS approval denial by mapping `COMPUTER_USE_APPROVAL_REQUIRED` to re‑arm‑required stop flow. Source: llm_adapter@2026-05-23 Confidence: high |
— |
| Bugfix | Medium |
Sanitizes base64 image data URLs from token estimates to prevent screenshot‑induced token inflation. Sanitizes base64 image data URLs from token estimates to prevent screenshot‑induced token inflation. Source: llm_adapter@2026-05-23 Confidence: high |
— |
Full changelog
Computer Use Remote: Full Desktop Control Pipeline
This release builds out the computer_use_remote tool into a complete, cross-platform host desktop control system with visual verification, multimodal capture handling, and platform-specific structural targeting.
Highlights
-
computer_use_remoteexposed as a callable tool — The model can now invokecomputer_use_remotedirectly in live sessions. Availability, trust mode, and re-arm enforcement remain runtime checks rather than prompt-loader gates. -
Visual verification required for all desktop actions — State-changing desktop actions are treated as unverified attempts until a fresh screenshot visibly confirms the outcome. Agents must stop and not proceed when a screenshot is unavailable.
-
Screenshots attached as multimodal tool results — Computer-use captures are returned as real multimodal vision messages (not just text summaries), so the model can visually inspect the screen after each action. Older capture payloads are pruned to prevent runaway context growth.
-
Host desktop cleanly separated from Xpra desktop —
computer_use_remoteis now the sole host desktop-control path;linux-desktoptargets only the internal Docker/Xpra environment. Host-screen queries rank ahead of the Xpra skill while explicit "Agent Zero Desktop" requests still route correctly. -
Platform-specific structural targeting skills:
- macOS — Dedicated skill for Accessibility (AX) structural targeting with
ax_snapshotandax_actionsupport, loaded only when the backend reports macOS capabilities. - Windows — UIA-based skill with window-management guidance, selector passthrough, and click-last workflow hints.
- Linux — AT-SPI/Wayland skill with compact structural tree outlines in snapshot responses for semantic target selection.
- Backend-specific action details are kept out of the generic prompt; the generic layer handles only backend discovery and skill loading.
- macOS — Dedicated skill for Accessibility (AX) structural targeting with
-
Codex OAuth proxy preserves vision inputs — Image content parts are now correctly converted to Responses API
input_imageparts instead of being flattened to text, with regression coverage for multimodal tool results containing screenshots. -
macOS approval denial handled gracefully —
COMPUTER_USE_APPROVAL_REQUIREDresponses map to the existing re-arm-required stop flow, preventing agents from retrying or falling back to screenshots when permissions haven't been granted. -
Prompt token accounting fixed for screenshots — Embedded base64 image data URLs are sanitized from token estimates so screenshot attachments no longer inflate context budgets.
-
Window-hide guidance updated — Ubuntu/GNOME/Wayland sessions now prefer
Super+HoverAlt+F9, with a reminder that keystroke results only prove the keys were sent, not that the action succeeded.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
Related context
Related tools
Earlier breaking changes
- v1.16 Legacy speech settings and APIs removed; use _kokoro_tts and _whisper_stt plugins instead.
- v1.14 Multi-action tools standardized around tool_args.action with backward compatibility
- v1.14 A0 connector remote workflow split into separate text-editor and code-execution skills
- v1.14 Office skills renamed to task-oriented names: Writer, Calc, Impress
Beta — feedback welcome: [email protected]