📊 Showing 7 results | 📏 Metric: Accuracy
Rank | Model | Paper | Accuracy | Date | Code |
---|---|---|---|---|---|
1 | VIOLETv2 | An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | 97.60 | 2022-09-04 | 📦 tsujuifu/pytorch_empirical-mvm |
2 | HiTeA | HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | 97.40 | 2022-12-30 | - |
3 | VindLU | VindLU: A Recipe for Effective Video-and-Language Pretraining | 95.50 | 2022-12-09 | 📦 klauscc/vindlu |
4 | Clover | Clover: Towards A Unified Video-Language Alignment and Fusion Model | 95.20 | 2022-07-16 | 📦 leeyn-43/clover |
5 | Singularity-temporal | Revealing Single Frame Bias for Video-and-Language Learning | 93.70 | 2022-06-07 | 📦 jayleicn/ClipBERT 📦 jayleicn/singularity |
6 | Norton | Multi-granularity Correspondence Learning from Long-term Noisy Videos | 92.70 | 2024-01-30 | 📦 XLearning-SCU/2024-ICLR-Norton |
7 | Singularity | Revealing Single Frame Bias for Video-and-Language Learning | 92.10 | 2022-06-07 | 📦 jayleicn/ClipBERT 📦 jayleicn/singularity |