WikiText-103

Dataset Information
Modalities
Texts
Languages
English
Introduced
2016
License
Homepage

Overview

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Source: The WikiText Long Term Dependency Language Modeling Dataset
Image Source: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

Variants: WikiText-103

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Language Modelling Ensemble of All Advancing State of the Art … 2023-11-28
Language Modelling Skip Cross-Head Transformer-XL Memory-efficient Stochastic methods for Memory-based … 2023-11-14
Language Modelling GateLoop (125M) GateLoop: Fully Data-Controlled Linear Recurrence … 2023-11-03
Language Modelling Transformer+SSA The Information Pathways Hypothesis: Transformers … 2023-06-02
Language Modelling Transformer+SSA+Self-ensemble The Information Pathways Hypothesis: Transformers … 2023-06-02
Language Modelling Primal.+Trans. Primal-Attention: Self-attention through Asymmetric Kernel … 2023-05-31
Language Modelling Hyena-3 Hyena Hierarchy: Towards Larger Convolutional … 2023-02-21
Language Modelling Hyena-3-slim Hyena Hierarchy: Towards Larger Convolutional … 2023-02-21
Language Modelling Hybrid H3 125M Hungry Hungry Hippos: Towards Language … 2022-12-28
Language Modelling Hybrid H3 (125M) Hungry Hungry Hippos: Towards Language … 2022-12-28
Language Modelling Hybrid H3 (1.3B) Hungry Hungry Hippos: Towards Language … 2022-12-28
Language Modelling Hybrid H3 (355M) Hungry Hungry Hippos: Towards Language … 2022-12-28
Language Modelling Hybrid H3 (2.7B) Hungry Hungry Hippos: Towards Language … 2022-12-28
Language Modelling kNN-LM w/ Adaptive Coefficient You can't pick your neighbors, … 2022-10-28
Language Modelling Mega Mega: Moving Average Equipped Gated … 2022-09-21
Language Modelling Perceiver AR 358M General-purpose, long-context autoregressive modeling with … 2022-02-15
Language Modelling RETRO (7.5B) Improving language models by retrieving … 2021-12-08
Language Modelling S4 Efficiently Modeling Long Sequences with … 2021-10-31
Language Modelling ∞-former (initialized GPT-2 Small) $\infty$-former: Infinite Memory Transformer 2021-09-01
Language Modelling ∞-former (Sticky memories + initialized GPT-2 Small) $\infty$-former: Infinite Memory Transformer 2021-09-01

Research Papers

Recent papers with results on this dataset: