Visual Information Seeking
In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.
Variants: InfoSeek
This dataset is used in 2 benchmarks:
Task | Model | Paper | Date |
---|---|---|---|
Visual Question Answering (VQA) | RA-VQAv2 w/ PreFLMR | PreFLMR: Scaling Up Fine-Grained Late-Interaction … | 2024-02-13 |
Retrieval | PreFLMR | PreFLMR: Scaling Up Fine-Grained Late-Interaction … | 2024-02-13 |
Visual Question Answering (VQA) | PaLI-X | PaLI-X: On Scaling up a … | 2023-05-29 |
Visual Question Answering (VQA) | CLIP + PaLM (540B) | Can Pre-trained Vision and Language … | 2023-02-23 |
Visual Question Answering (VQA) | CLIP + FiD | Can Pre-trained Vision and Language … | 2023-02-23 |
Visual Question Answering (VQA) | PaLI | Can Pre-trained Vision and Language … | 2023-02-23 |
Visual Question Answering (VQA) | BLIP2 | BLIP-2: Bootstrapping Language-Image Pre-training with … | 2023-01-30 |
Recent papers with results on this dataset: