A Remote Code Execution about modern “Computer Use” agents, Microsoft’s OmniParser/OmniTool, and what happens when capability meets reachability.

“In agent stacks, every HTTP port that can do things is a pair of hands. Make sure they’re yours.”

Video from Microsoft

TL;DR

While mapping Microsoft’s OmniParser/OmniTool , I followed the path from prompt → parsing → action and found a reachable, unauthenticated execution surface on the VM controller. If the service is network-accessible, that path becomes remote code execution (RCE) remote control by design. Attackers can send command directly to control the GUI Agent running on the computer. Microsoft acknowledged the issue (MSRC Case 97706), shipped a fix, and has assigned CVE-2025-55322 (https://msrc.microsoft.com/update-guide/en-US/vulnerability/CVE-2025-55322). Upgrade and harden now.

What is OmniParser (and why it’s important)

OmniParser converts screenshots into labeled UI elements so LLMs can “ground” actions on pixels - icons, text, and their semantics. Microsoft’s V2 post frames it explicitly as turning any LLM into a Computer Use agent (Microsoft Research). The project site collects examples and benchmarks, and the code lives in Microsoft’s repo (Project page; GitHub). On GitHub alone, the repo shows ~23.6k stars and ~2k forks, a serious adoption and active maintenance. Microsoft News

imgFigure 1: OmniParser A GUI Agent automating your PC

Coverage highlighted OmniParser’s rise to the top of open-source charts and its role in the industry’s sprint toward agentic “screen interaction” (VentureBeat). Meanwhile, a broader wave is pushing “Computer Use” from concept to product - see Anthropic’s public beta and Microsoft Azure’s docs for production deployment guidance (Anthropic announcement, Azure OpenAI Computer Use docs). Venturebeat

img Figure 2- OmniParser’s rise & role in Computer Use

From “click” to “command”

I reproduced the same three-box pipeline you see in guides and the repo: a UI layer, OmniParser for screen parsing, and a VM controller that carries out actions (OmniTool readme; community how-to). The controller exposes an HTTP interface to execute operations on the VM - that’s how agents can actually click, type, and launch apps. Venturebeat

img Figure 3: Overall Components UI → Parser → VM Controller

Then came the turn: if that execution interface is reachable and doesn’t require identity, the same mechanism becomes an RCE surface. No heap tricks; just a powerful, legitimate control plane left open to whoever can talk to it.

imgFigure 4: Code with Flaw, Opening to internet without permission control

imgFigure 5: Complete Attack flow

Why this is personal: “running an assistant” can mean “exposing your computer”

If you run this as a daily assistant on your own laptop or desktop - for example to automate email, spreadsheets, and browser tasks - and the VM controller becomes reachable (e.g., LAN exposure, port forwarding, permissive firewall, misconfigured reverse proxy), an attacker who can reach those ports can drive your computer:

  • Immediate Remote Code Execution Control: the execution endpoint lets them run arbitrary commands with your privileges.
  • Data exposure: they can read files, tokens, screenshots, and parse artifacts, then exfiltrate.
  • Persistence: one command can drop an SSH key or scheduled task.
  • Operational cover: if they route actions through the UI → Parser → Controller path, logs can resemble “ordinary automation.”

This isn’t theoretical; it’s the same pattern showing up across the AI infra wave: capability + reachability = risk.

Echoes across the ecosystem

Different bugs; shared moral: in agent systems, control planes are production surfaces.

What changed upstream

Microsoft shipped OmniParser v2.0.1 with “Security Updates.” If you cloned older examples or mirrored configs, upgrade and audit your runtime wiring before re-exposing anything (release notes).

img

Operator’s checklist

  • Constrain reachability: keep parser/VM control non-public; prefer container/VM-internal networking; if exposure is required, put it behind a hardened reverse proxy.
  • Require identity: enforce auth (tokens/mTLS) for any control/execute API; disable debug; add rate limits.
  • Constrain capability: replace “arbitrary command” surfaces with a high-level DSL (click, type, open, wait); validate and sanitize parameters.
  • Log for intent: record caller, endpoint, and action class; alert on unusual sequences (rapid executes, file ops).

Coordinated disclosure timeline (MSRC Case 97706)

  • May 12, 2025 (10:23 PT) - Report filed to Microsoft: reachable, unauthenticated execution path on the VM controller and Gradio; multiple agent components accessible over HTTP.
  • May 13, 2025 (14:55 PT) - MSRC opens Case 97706; triage begins.
  • May 29, 2025 - Full PoC and ecosystem analogs shared
  • Aug 21–Sep 4, 2025 - MSRC: fix in progress; coordination on disclosure timing.
  • Sep 10, 2025 - MSRC confirms a fix has shipped and prepares researcher acknowledgement.
  • Sep 12, 2025 - OmniParser v2.0.1 release with “Security Updates” (notes).
  • Sep 24, 2025 - MSRC assigns CVE-2025-55322.

imgSimilar issues reported by other Users on Gradio Component https://github.com/microsoft/OmniParser/issues/233

Final thought: the road for “Computer Use”

The market is racing toward agents that see and act. OmniParser/OmniTool helped catalyze that momentum - and the project’s popularity shows builders are already here. But because these agents can act, their control planes must evolve from convenience endpoints into production-grade surfaces with identity, isolation, and guardrails baked in.

“Computer Use means your model doesn’t just say - it does*. From now on, every ‘do’ must start with* who.”*

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Sources & further reading