← ML Research Wiki / 2305.05665

IMAGEBIND: One Embedding Space To Bind Them All

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev, Alwala Armand Joulin, Ishan Misra, Meta Fair, Ai, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev, Alwala Armand Joulin, Ishan Misra, Meta Fair, Ai, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev, Alwala Armand Joulin, Ishan Misra, Meta Fair, Ai (2023)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
artificial intelligence, computer vision, multimedia, machine learning
SOTA Claim
Yes
Reproducibility
7/10

Abstract

Abstract not available.

Summary

The paper introduces IMAGEBIND, a novel approach for learning a joint embedding space across six modalities: images, text, audio, depth, thermal, and IMU data. The authors demonstrate that IMAGEBIND can effectively bind these modalities together using minimal image-paired data, facilitating cross-modal retrieval and enabling zero-shot recognition capabilities. This approach alleviates the need for extensive multimodal datasets where all modalities are simultaneously present. The main contributions include impressive performance in emergent zero-shot tasks, few-shot recognition success, and enhanced capabilities for generating and detecting through a unified embedding model. The paper highlights the potential for extensive applications across various domains without requiring extensive retraining or specific tasks. Additionally, it addresses limitations, emphasizing the need for further research to improve upon their model and explore novel evaluation benchmarks.

Methods

This paper employs the following methods:

  • Embedding-Space Arithmetic
  • Contrastive Learning

Models Used

  • DALLE-2
  • CLIP

Datasets

The following datasets were used in this research:

  • Audioset
  • ESC
  • Clotho
  • AudioCaps
  • VGGSound
  • SUN RGB-D
  • SUN Depth-only
  • NYU-v2 Depth-only
  • LLVIP
  • Ego4D

Evaluation Metrics

  • mAP
  • Accuracy
  • Recall

Results

  • State-of-the-art performance in emergent zero-shot recognition tasks
  • Strong few-shot recognition results

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

multimodal embeddings zero-shot learning contrastive learning vision-language models audio-visual data cross-modal retrieval generative models

Papers Using Similar Methods

External Resources