Situated Question Answering in 3D Scenes
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
Variants: SQA3D
This dataset is used in 2 benchmarks:
Task | Model | Paper | Date |
---|---|---|---|
Question Answering | Lexicon3D | Lexicon3D: Probing Visual Foundation Models … | 2024-09-05 |
Question Answering | Situation3D | Situational Awareness Matters in 3D … | 2024-06-11 |
Question Answering | CREMA | CREMA: Generalizable and Efficient Video-Language … | 2024-02-08 |
Question Answering | LM4VisualEncoding | Frozen Transformers in Language Models … | 2023-10-19 |
Referring Expression | Random | SQA3D: Situated Question Answering in … | 2022-10-14 |
Question Answering | ScanQA (w/ auxiliary loss) | SQA3D: Situated Question Answering in … | 2022-10-14 |
Question Answering | ScanQA | SQA3D: Situated Question Answering in … | 2022-10-14 |
Question Answering | MCAN | Deep Modular Co-Attention Networks for … | 2019-06-25 |
Recent papers with results on this dataset: