OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent

  • The OS has additional information including how different graphics layers are composited, and what accessibility metadata is attached to interface elements. It ought to be useful to exploit this to do better than screenshot parsing.

  • This is not the intended use but it good working on parsing document layout from image.

  • One ponders the connections with the Recall feature

  • Very cool work. Accurate GUI text and element parsing is exactly the kind of input that LLMs need to be effective agents.