WikiText-103

Name: WikiText-103
Published: 2016-09-26
License: CC BY-SA 3.0

Dataset Information

Modalities

Texts

Languages

English

Introduced

2016

License

CC BY-SA 3.0

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Source: The WikiText Long Term Dependency Language Modeling Dataset
Image Source: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

Variants: WikiText-103

Associated Benchmarks

This dataset is used in 2 benchmarks:

Text Generation - Metrics: Perplexity
Language Modelling - Metrics: Test perplexity, Validation perplexity, Number of params

Recent Benchmark Submissions

Task	Model	Paper	Date
Language Modelling	Ensemble of All	Advancing State of the Art …	2023-11-28
Language Modelling	Skip Cross-Head Transformer-XL	Memory-efficient Stochastic methods for Memory-based …	2023-11-14
Language Modelling	GateLoop (125M)	GateLoop: Fully Data-Controlled Linear Recurrence …	2023-11-03
Language Modelling	Transformer+SSA	The Information Pathways Hypothesis: Transformers …	2023-06-02
Language Modelling	Transformer+SSA+Self-ensemble	The Information Pathways Hypothesis: Transformers …	2023-06-02
Language Modelling	Primal.+Trans.	Primal-Attention: Self-attention through Asymmetric Kernel …	2023-05-31
Language Modelling	Hyena-3	Hyena Hierarchy: Towards Larger Convolutional …	2023-02-21
Language Modelling	Hyena-3-slim	Hyena Hierarchy: Towards Larger Convolutional …	2023-02-21
Language Modelling	Hybrid H3 125M	Hungry Hungry Hippos: Towards Language …	2022-12-28
Language Modelling	Hybrid H3 (125M)	Hungry Hungry Hippos: Towards Language …	2022-12-28
Language Modelling	Hybrid H3 (1.3B)	Hungry Hungry Hippos: Towards Language …	2022-12-28
Language Modelling	Hybrid H3 (355M)	Hungry Hungry Hippos: Towards Language …	2022-12-28
Language Modelling	Hybrid H3 (2.7B)	Hungry Hungry Hippos: Towards Language …	2022-12-28
Language Modelling	kNN-LM w/ Adaptive Coefficient	You can't pick your neighbors, …	2022-10-28
Language Modelling	Mega	Mega: Moving Average Equipped Gated …	2022-09-21
Language Modelling	Perceiver AR 358M	General-purpose, long-context autoregressive modeling with …	2022-02-15
Language Modelling	RETRO (7.5B)	Improving language models by retrieving …	2021-12-08
Language Modelling	S4	Efficiently Modeling Long Sequences with …	2021-10-31
Language Modelling	∞-former (initialized GPT-2 Small)	$\infty$-former: Infinite Memory Transformer	2021-09-01
Language Modelling	∞-former (Sticky memories + initialized GPT-2 Small)	$\infty$-former: Infinite Memory Transformer	2021-09-01

Research Papers

Recent papers with results on this dataset:

External Links:

WikiText-103

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview