Mirasol3B
|
Mirasol3B: A Multimodal Autoregressive model for …
|
50.42
|
2023-11-09
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
50.10
|
2023-05-29
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
49.20
|
2023-04-17
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
49.20
|
2023-06-15
|
|
MA-LMM
|
MA-LMM: Memory-Augmented Large Multimodal Model f…
|
48.50
|
2024-04-08
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
48.00
|
2023-02-01
|
|
FrozenBiLM
|
Zero-Shot Video Question Answering via Frozen Bid…
|
47.00
|
2022-06-16
|
|
HBI
|
Video-Text as Game Players: Hierarchical Banzhaf …
|
46.20
|
2023-03-25
|
|
EMCL-Net
|
Expectation-Maximization Contrastive Learning for…
|
45.80
|
2022-11-21
|
|
VindLU
|
VindLU: A Recipe for Effective Video-and-Language…
|
44.60
|
2022-12-09
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
44.50
|
2022-09-04
|
|
Singularity-temporal
|
Revealing Single Frame Bias for Video-and-Languag…
|
43.90
|
2022-06-07
|
|
Singularity
|
Revealing Single Frame Bias for Video-and-Languag…
|
43.50
|
2022-06-07
|
|
FrozenBiLM (0-shot)
|
Zero-Shot Video Question Answering via Frozen Bid…
|
16.70
|
2022-06-16
|
|