InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
33.80
|
2024-03-22
|
|
InternVideo2-1B
|
InternVideo2: Scaling Foundation Models for Multi…
|
32.00
|
2024-03-22
|
|
VAST, HowToCaption-finetuned
|
HowToCaption: Prompting LLMs to Transform Video A…
|
27.70
|
2023-10-07
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
25.20
|
2023-03-28
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
24.10
|
2023-02-01
|
|
BT-Adapter
|
BT-Adapter: Video Conversation is Feasible Withou…
|
19.50
|
2023-09-27
|
|
HiTeA-17M
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
18.30
|
2022-12-30
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
17.60
|
2022-12-06
|
|
HowToCaption
|
HowToCaption: Prompting LLMs to Transform Video A…
|
17.30
|
2023-10-07
|
|
Yatai Ji et. al.
|
Seeing What You Miss: Vision-Language Pre-trainin…
|
17.20
|
2022-11-24
|
|
HiTeA-5M
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
15.50
|
2022-12-30
|
|
CLIP4Clip
|
CLIP4Clip: An Empirical Study of CLIP for End to …
|
15.10
|
2021-04-18
|
|
Clover
|
Clover: Towards A Unified Video-Language Alignmen…
|
14.70
|
2022-07-16
|
|
Y. Ge et. al.
|
Bridging Video-text Retrieval with Multiple Choic…
|
12.20
|
2022-01-13
|
|
MILES
|
MILES: Visual BERT Pre-training with Injected Lan…
|
11.10
|
2022-04-26
|
|
SSML
|
Noise Estimation Using Density Estimation for Sel…
|
4.20
|
2020-03-06
|
|