📊 Showing 5 results | 📏 Metric: Acc
Rank | Model | Paper | Acc | Date | Code |
---|---|---|---|---|---|
1 | VAST 📚 | VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | 80.70 | 2023-05-29 | 📦 TXH-mercury/VALOR 📦 txh-mercury/vast |
2 | VALOR 📚 | VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | 78.90 | 2023-04-17 | 📦 TXH-mercury/VALOR |
3 | CAD | CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | 78.26 | 2023-10-25 | - |
4 | LAVISH | Vision Transformers are Parameter-Efficient Audio-Visual Learners | 77.08 | 2022-12-15 | 📦 GenjiB/LAVISH |
5 | ST-AVQA | Learning to Answer Questions in Dynamic Audio-Visual Scenarios | 71.52 | 2022-03-26 | 📦 GeWu-Lab/MUSIC-AVQA |