OpenSubtitles

Dataset Information
Languages
Russian
License
Unknown
Homepage

Overview

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

Variants: OpenSubtitles, OpenSubtitles

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Machine Translation Fine tuned MarianMT Crossing Language Borders: A Pipeline … 2025-01-03
Language Identification Apple bi-LSTM A reproduction of Apple's bi-directional … 2021-02-11

Research Papers

Recent papers with results on this dataset: