ML Research Wiki / Benchmarks / Natural Language Visual Grounding / ScreenSpot

ScreenSpot

Natural Language Visual Grounding Benchmark

Performance Over Time

📊 Showing 18 results | 📏 Metric: Accuracy (%)

Top Performing Models

Rank	Model	Paper	Accuracy (%)	Date	Code
1	UGround-V1-7B	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	86.34	2024-10-07	📦 OSU-NLP-Group/UGround
2	Aguvis-7B	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	83.00	2024-12-05	📦 xlang-ai/aguvis
3	OS-Atlas-Base-7B	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	82.47	2024-10-30	📦 njucckevin/seeclick 📦 OS-Copilot/OS-Atlas
4	Aria-UI	Aria-UI: Visual Grounding for GUI Instructions	81.10	2024-12-20	📦 ariaui/aria-ui
5	Aguvis-G-7B	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	81.00	2024-12-05	📦 xlang-ai/aguvis
6	UGround-V1-2B	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	77.67	2024-10-07	📦 OSU-NLP-Group/UGround
7	ShowUI	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	75.10	2024-11-26	📦 showlab/showui
8	ShowUI-G	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	75.00	2024-11-26	📦 showlab/showui
9	UGround	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	73.30	2024-10-07	📦 OSU-NLP-Group/UGround
10	OmniParser	OmniParser for Pure Vision Based GUI Agent	73.00	2024-08-01	📦 microsoft/omniparser

All Papers (18)

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

2024

UGround-V1-7B

OSU-NLP-Group/UGround

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

2024

Aguvis-7B

xlang-ai/aguvis

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

2024

OS-Atlas-Base-7B

njucckevin/seeclick OS-Copilot/OS-Atlas

Aria-UI: Visual Grounding for GUI Instructions

2024

Aria-UI

ariaui/aria-ui

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

2024

Aguvis-G-7B

xlang-ai/aguvis

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

2024

UGround-V1-2B

OSU-NLP-Group/UGround

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

2024

ShowUI

showlab/showui

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

2024

ShowUI-G

showlab/showui

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

2024

UGround

OSU-NLP-Group/UGround

OmniParser for Pure Vision Based GUI Agent

2024

OmniParser

microsoft/omniparser

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

2024

OS-Atlas-Base-4B

njucckevin/seeclick OS-Copilot/OS-Atlas

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

2024

SeeClick

njucckevin/seeclick

CogAgent: A Visual Language Model for GUI Agents

2023

CogAgent

thudm/cogvlm THUDM/CogAgent digirl-agent/digirl

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-7B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

GUICourse: From General Vision Language Models to Versatile GUI Agents

2024

Qwen-GUI

yiye3/guicourse

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

2023

MiniGPT-v2

vision-cair/minigpt-4 zebangcheng/emotion-llama

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

2024

Groma

FoundationVision/Groma

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

2023

Qwen-VL

qwenlm/qwen-vl brandon3964/multimodal-task-vector

ScreenSpot

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (18)

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Aria-UI: Visual Grounding for GUI Instructions

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

OmniParser for Pure Vision Based GUI Agent

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

CogAgent: A Visual Language Model for GUI Agents

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

GUICourse: From General Vision Language Models to Versatile GUI Agents

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Model	Paper	Accuracy (%)	Date
UGround-V1-7B	Navigating the Digital World as Humans Do: Univer…	86.34	2024-10-07
Aguvis-7B	Aguvis: Unified Pure Vision Agents for Autonomous…	83.00	2024-12-05
OS-Atlas-Base-7B	OS-ATLAS: A Foundation Action Model for Generalis…	82.47	2024-10-30
Aria-UI	Aria-UI: Visual Grounding for GUI Instructions	81.10	2024-12-20
Aguvis-G-7B	Aguvis: Unified Pure Vision Agents for Autonomous…	81.00	2024-12-05
UGround-V1-2B	Navigating the Digital World as Humans Do: Univer…	77.67	2024-10-07
ShowUI	ShowUI: One Vision-Language-Action Model for GUI …	75.10	2024-11-26
ShowUI-G	ShowUI: One Vision-Language-Action Model for GUI …	75.00	2024-11-26
UGround	Navigating the Digital World as Humans Do: Univer…	73.30	2024-10-07
OmniParser	OmniParser for Pure Vision Based GUI Agent	73.00	2024-08-01
OS-Atlas-Base-4B	OS-ATLAS: A Foundation Action Model for Generalis…	68.00	2024-10-30
SeeClick	SeeClick: Harnessing GUI Grounding for Advanced V…	53.40	2024-01-17
CogAgent	CogAgent: A Visual Language Model for GUI Agents	47.40	2023-12-14
Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	42.10	2024-09-18
Qwen-GUI	GUICourse: From General Vision Language Models to…	28.60	2024-06-17
MiniGPT-v2	MiniGPT-v2: large language model as a unified int…	5.70	2023-10-14
Groma	Groma: Localized Visual Tokenization for Groundin…	5.20	2024-04-19
Qwen-VL	Qwen-VL: A Versatile Vision-Language Model for Un…	5.20	2023-08-24