The RetVQA dataset is a large-scale dataset designed for Retrieval-Based Visual Question Answering (RetVQA). RetVQA is a more challenging task than traditional VQA, as it requires models to retrieve relevant images from a pool of images before answering a question. The need for RetVQA stems from the fact that information needed to answer a question may be spread across multiple images.
Here is a detailed summary of the RetVQA dataset:
- It is 20 times larger than the closest dataset in this setting, WebQA.
- It was derived from the Visual Genome dataset, utilising its questions and annotations of images.
- It has 418K unique questions and 16,205 unique precise answers.
- The questions are designed to be metadata-independent, meaning they do not rely on information such as captions or tags.
- The questions are divided into five categories:
- color
- shape
- count
- object-attributes
- relation-based.
- The dataset includes both binary (yes/no) questions and open-ended questions that require a generative answer.
- All answers are free-form and fluent, even for binary questions. For example, a binary question may be "Do the rose and sunflower share the same colour?", and a corresponding answer would be "No, the rose and sunflower do not share the same colour".
- Every question in RetVQA requires reasoning over multiple images to arrive at the answer. This contrasts with datasets like WebQA, where a majority of questions can be answered using a single image.
- The dataset has, on average, two relevant images and 24.5 irrelevant images per question. This makes it more challenging than datasets like ISVQA, where images are homogeneous and no explicit retrieval is needed.
Variants: RetVQA
This dataset is used in 1 benchmark:
Recent papers with results on this dataset: