MultiSubs

Name: MultiSubs
Published: 2021-06-30
License: Creative Commons Attribution 4.0 International

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

Dataset Information

Modalities

Images, Texts

Languages

English, French, Spanish, German, Portuguese

Introduced

2021

License

Creative Commons Attribution 4.0 International

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.

Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia (2021). MultiSubs: A Large-scale Multimodal and Multilingual Dataset. CoRR, abs/2103.01910. Available at: [https://arxiv.org/abs/2103.01910]
(https://arxiv.org/abs/2103.01910)

Variants: MultiSubs, MultiSubs English-Spanish, MultiSubs English-Portuguese, MultiSubs English-French, MultiSubs English-German

Associated Benchmarks

This dataset is used in 1 benchmark:

Multimodal Text Prediction - Metrics: Accuracy, Word similarity

Recent Benchmark Submissions

Task	Model	Paper	Date
Multimodal Text Prediction	9-gram LM with back-off	MultiSubs: A Large-scale Multimodal and …	2021-03-02

Research Papers

Recent papers with results on this dataset:

MultiSubs: A Large-scale Multimodal and Multilingual Dataset (2021) -

External Links:

MultiSubs

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview