WikiText-2

Dataset Information
Modalities
Texts
Languages
English, Spanish, German, Swedish
Introduced
2016
License
Homepage

Overview

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Source: The WikiText Long Term Dependency Language Modeling Dataset
Image Source: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

Variants: WikiText-103, WikiText-2, wikitext wikitext-2-raw-v1

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Language Modelling Ensemble of All Advancing State of the Art … 2023-11-28
Language Modelling OPT-175B SparseGPT: Massive Language Models Can … 2023-01-02
Language Modelling SparseGPT (175B, 4:8 Sparsity) SparseGPT: Massive Language Models Can … 2023-01-02
Language Modelling SparseGPT (175B, 2:4 Sparsity) SparseGPT: Massive Language Models Can … 2023-01-02
Language Modelling SparseGPT (175B, 50% Sparsity) SparseGPT: Massive Language Models Can … 2023-01-02
Language Modelling OPT-175B (50% Sparsity) SparseGPT: Massive Language Models Can … 2023-01-02
Language Modelling EGRU Efficient recurrent architectures through activity … 2022-06-13
Language Modelling GPT-2 (fine-tuned) Hydra: A System for Large … 2021-10-16
Language Modelling AWD-FWM Schlag et al. (2020) Learning Associative Inference Using Fast … 2020-11-16
Language Modelling AWD-LSTM + ATOI Alleviating Sequence Information Loss with … 2019-09-18
Language Modelling Mogrifier LSTM + dynamic eval Mogrifier LSTM 2019-09-04
Language Modelling Mogrifier LSTM Mogrifier LSTM 2019-09-04
Language Modelling adversarial + AWD-LSTM-MoS + dynamic eval Improving Neural Language Modeling via … 2019-06-10
Language Modelling AWD-LSTM-DRILL Deep Residual Output Layers for … 2019-05-14
Language Modelling AWD-LSTM-DRILL + dynamic eval Deep Residual Output Layers for … 2019-05-14
Language Modelling BERT-Large-CAS Language Models with Transformers 2019-04-20
Language Modelling AWD-LSTM-MoS + Partial Shuffle Partially Shuffling the Training Data … 2019-03-11
Language Modelling AWD-LSTM-DOC + Partial Shuffle Partially Shuffling the Training Data … 2019-03-11
Language Modelling FRAGE + AWD-LSTM-MoS + dynamic eval FRAGE: Frequency-Agnostic Word Representation 2018-09-18
Language Modelling AWD-LSTM-DOC x5 Direct Output Connection for a … 2018-08-30

Research Papers

Recent papers with results on this dataset: