ScreenSpot

Dataset Information
Modalities
Images, Texts
Languages
English
Introduced
2024
License

Overview

ScreenSpot Evaluation Benchmark

ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1,200 instructions from various environments, including iOS, Android, macOS, Windows, and Web. Each data point includes annotated element types (Text or Icon/Widget). For more details and examples, please refer to our paper.

Test Sample Details

Each test sample includes:

  • img_filename: The interface screenshot file.
  • instruction: Human-provided instruction.
  • bbox: The bounding box of the target element corresponding to the instruction.
  • data_type: The type of the target element, either "icon" or "text".
  • data_source: The interface platform, which could be iOS, Android, macOS, Windows, or Web (e.g., GitLab, Shop, Forum, Tool).

Variants: ScreenSpot

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Natural Language Visual Grounding Aria-UI Aria-UI: Visual Grounding for GUI … 2024-12-20
Natural Language Visual Grounding Aguvis-7B Aguvis: Unified Pure Vision Agents … 2024-12-05
Natural Language Visual Grounding Aguvis-G-7B Aguvis: Unified Pure Vision Agents … 2024-12-05
Natural Language Visual Grounding ShowUI-G ShowUI: One Vision-Language-Action Model for … 2024-11-26
Natural Language Visual Grounding ShowUI ShowUI: One Vision-Language-Action Model for … 2024-11-26
Natural Language Visual Grounding OS-Atlas-Base-7B OS-ATLAS: A Foundation Action Model … 2024-10-30
Natural Language Visual Grounding OS-Atlas-Base-4B OS-ATLAS: A Foundation Action Model … 2024-10-30
Natural Language Visual Grounding UGround-V1-2B Navigating the Digital World as … 2024-10-07
Natural Language Visual Grounding UGround Navigating the Digital World as … 2024-10-07
Natural Language Visual Grounding UGround-V1-7B Navigating the Digital World as … 2024-10-07
Natural Language Visual Grounding Qwen2-VL-7B Qwen2-VL: Enhancing Vision-Language Model's Perception … 2024-09-18
Natural Language Visual Grounding OmniParser OmniParser for Pure Vision Based … 2024-08-01
Natural Language Visual Grounding Qwen-GUI GUICourse: From General Vision Language … 2024-06-17
Natural Language Visual Grounding Groma Groma: Localized Visual Tokenization for … 2024-04-19
Natural Language Visual Grounding SeeClick SeeClick: Harnessing GUI Grounding for … 2024-01-17
Natural Language Visual Grounding CogAgent CogAgent: A Visual Language Model … 2023-12-14
Natural Language Visual Grounding MiniGPT-v2 MiniGPT-v2: large language model as … 2023-10-14
Natural Language Visual Grounding Qwen-VL Qwen-VL: A Versatile Vision-Language Model … 2023-08-24

Research Papers

Recent papers with results on this dataset: