InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
59.30
|
2024-03-22
|
|
InternVideo2-1B
|
InternVideo2: Scaling Foundation Models for Multi…
|
58.10
|
2024-03-22
|
|
VAST, HowToCaption-finetuned
|
HowToCaption: Prompting LLMs to Transform Video A…
|
54.80
|
2023-10-07
|
|
LanguageBind(ViT-L/14)
|
LanguageBind: Extending Video-Language Pretrainin…
|
54.10
|
2023-10-03
|
|
LanguageBind(ViT-H/14)
|
LanguageBind: Extending Video-Language Pretrainin…
|
53.90
|
2023-10-03
|
|
vid-TLDR (UMT-L)
|
vid-TLDR: Training Free Token merging for Light-w…
|
50.00
|
2024-03-20
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
49.00
|
2023-03-28
|
|
HowToCaption
|
HowToCaption: Prompting LLMs to Transform Video A…
|
44.50
|
2023-10-07
|
|
MILES
|
MILES: Visual BERT Pre-training with Injected Lan…
|
44.40
|
2022-04-26
|
|
Y. Ge et. al.
|
Bridging Video-text Retrieval with Multiple Choic…
|
43.60
|
2022-01-13
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
43.40
|
2022-12-06
|
|
CLIP4Clip
|
CLIP4Clip: An Empirical Study of CLIP for End to …
|
38.50
|
2021-04-18
|
|
LaT
|
LaT: Latent Translation with Cycle-Consistency fo…
|
36.90
|
2022-07-11
|
|
SSML
|
Noise Estimation Using Density Estimation for Sel…
|
13.66
|
2020-03-06
|
|