Visual Spatial Reasoning
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
Variants: VSR
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Visual Reasoning | LXMERT | Visual Spatial Reasoning | 2022-04-30 |
Visual Reasoning | ViLT | Visual Spatial Reasoning | 2022-04-30 |
Visual Reasoning | CLIP (finetuned) | Visual Spatial Reasoning | 2022-04-30 |
Visual Reasoning | CLIP (frozen) | Visual Spatial Reasoning | 2022-04-30 |
Visual Reasoning | VisualBERT | Visual Spatial Reasoning | 2022-04-30 |
Recent papers with results on this dataset: