ML Research Wiki / Benchmarks / Language Modelling / WikiText-2

WikiText-2

Language Modelling Benchmark

Performance Over Time

📊 Showing 34 results | 📏 Metric: Test perplexity

Top Performing Models

Rank	Model	Paper	Test perplexity	Date	Code
1	OPT-175B (50% Sparsity) 📚	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	234.77	2023-01-02	📦 nvidia/tensorrt-model-optimizer 📦 ist-daslab/sparsegpt 📦 nvlabs/maskllm
2	Grave et al. (2016) - LSTM	Improving Neural Language Models with a Continuous Cache	99.30	2016-12-13	📦 dmlc/gluon-nlp 📦 salesforce/awd-lstm-lm 📦 uclanlp/NamedEntityLanguageModel
3	Inan et al. (2016) - Variational LSTM (tied) (h=650)	Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling	87.70	2016-11-04	📦 JianGoForIt/YellowFin_Pytorch 📦 rdspring1/PyTorch_GBW_LM 📦 floydhub/word-language-model 📦 InnerPeace-Wu/im2p-tensorflow 📦 Ravoxsg/Word-level-language-modeling
4	Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss	Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling	87.00	2016-11-04	📦 JianGoForIt/YellowFin_Pytorch 📦 rdspring1/PyTorch_GBW_LM 📦 floydhub/word-language-model 📦 InnerPeace-Wu/im2p-tensorflow 📦 Ravoxsg/Word-level-language-modeling
5	Grave et al. (2016) - LSTM + continuous cache pointer	Improving Neural Language Models with a Continuous Cache	68.90	2016-12-13	📦 dmlc/gluon-nlp 📦 salesforce/awd-lstm-lm 📦 uclanlp/NamedEntityLanguageModel
6	EGRU	Efficient recurrent architectures through activity sparsity and sparse back-propagation through time	68.90	2022-06-13	📦 khaleelkhan/evnn
7	Melis et al. (2017) - 1-layer LSTM (tied)	On the State of the Art of Evaluation in Neural Language Models	65.90	2017-07-18	📦 deepmind/lamb
8	AWD-LSTM	Regularizing and Optimizing LSTM Language Models	65.80	2017-08-07	📦 google-research/google-research 📦 fastai/fastai 📦 dmlc/gluon-nlp
9	AWD-LSTM + ATOI	Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes	64.73	2019-09-18	📦 nkcr/overlap-ml
10	AWD-LSTM 3-layer with Fraternal dropout	Fraternal Dropout	64.10	2017-10-31	📦 kondiz/fraternal-dropout

All Papers (34)

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

OPT-175B (50% Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

Improving Neural Language Models with a Continuous Cache

2016

Grave et al. (2016) - LSTM

dmlc/gluon-nlp salesforce/awd-lstm-lm

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

2016

Inan et al. (2016) - Variational LSTM (tied) (h=650)

JianGoForIt/YellowFin_Pytorch rdspring1/PyTorch_GBW_LM

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

2016

Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss

JianGoForIt/YellowFin_Pytorch rdspring1/PyTorch_GBW_LM

Improving Neural Language Models with a Continuous Cache

2016

Grave et al. (2016) - LSTM + continuous cache pointer

dmlc/gluon-nlp salesforce/awd-lstm-lm

Efficient recurrent architectures through activity sparsity and sparse back-propagation through time

2022

EGRU

khaleelkhan/evnn

On the State of the Art of Evaluation in Neural Language Models

2017

Melis et al. (2017) - 1-layer LSTM (tied)

deepmind/lamb

Regularizing and Optimizing LSTM Language Models

2017

AWD-LSTM

google-research/google-research fastai/fastai

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

2019

AWD-LSTM + ATOI

nkcr/overlap-ml

Fraternal Dropout

2017

AWD-LSTM 3-layer with Fraternal dropout

kondiz/fraternal-dropout

Deep Residual Output Layers for Neural Language Generation

2019

AWD-LSTM-DRILL

idiap/drill

Learning Associative Inference Using Fast Weight Memory

2020

AWD-FWM Schlag et al. (2020)

ischlag/Fast-Weight-Memory-public

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

2017

AWD-LSTM-MoS

zihangdai/mos cstorm125/thai2fit

Partially Shuffling the Training Data to Improve Language Models

2019

AWD-LSTM-MoS + Partial Shuffle

ofirpress/PartialShuffle

Direct Output Connection for a High-Rank Language Model

2018

AWD-LSTM-DOC

nttcslab-nlp/doc_lm

Partially Shuffling the Training Data to Improve Language Models

2019

AWD-LSTM-DOC + Partial Shuffle

ofirpress/PartialShuffle

Mogrifier LSTM

2019

Mogrifier LSTM

deepmind/lamb RMichaelSwan/MogrifierLSTM microcoder-py/mogrifier-lstm

Advancing State of the Art in Language Modeling

2023

Ensemble of All

davidherel/sota_lm

Direct Output Connection for a High-Rank Language Model

2018

AWD-LSTM-DOC x5

nttcslab-nlp/doc_lm

Regularizing and Optimizing LSTM Language Models

2017

AWD-LSTM + continuous cache pointer

google-research/google-research fastai/fastai

Dynamic Evaluation of Neural Sequence Models

2017

AWD-LSTM + dynamic eval

benkrause/dynamic-evaluation benkrause/dynamiceval-transformer sacmehta/PRU

Deep Residual Output Layers for Neural Language Generation

2019

AWD-LSTM-DRILL + dynamic eval

idiap/drill

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

2017

AWD-LSTM-MoS + dynamic eval

zihangdai/mos cstorm125/thai2fit

Gradual Learning of Recurrent Neural Networks

2017

GL-LWGC + AWD-MoS-LSTM + dynamic eval

zivaharoni/gradual-learning-rnn

Improved Language Modeling by Decoding the Past

2018

Past Decode Reg. + AWD-LSTM-MoS + dyn. eval.

FRAGE: Frequency-Agnostic Word Representation

2018

FRAGE + AWD-LSTM-MoS + dynamic eval

ChengyueGongR/FrequencyAgnostic JakubStefko/w2vf

Improving Neural Language Modeling via Adversarial Training

2019

adversarial + AWD-LSTM-MoS + dynamic eval

ChengyueGongR/advsoft

Mogrifier LSTM

2019

Mogrifier LSTM + dynamic eval

deepmind/lamb RMichaelSwan/MogrifierLSTM microcoder-py/mogrifier-lstm

Language Models with Transformers

2019

BERT-Large-CAS

cgraywang/gluon-nlp-1

Hydra: A System for Large Multi-Model Deep Learning

2021

GPT-2 (fine-tuned)

knagrecha/hydra

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT (175B, 2:4 Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT (175B, 4:8 Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

OPT-175B

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT (175B, 50% Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

WikiText-2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (34)

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Improving Neural Language Models with a Continuous Cache

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Improving Neural Language Models with a Continuous Cache

Efficient recurrent architectures through activity sparsity and sparse back-propagation through time

On the State of the Art of Evaluation in Neural Language Models

Regularizing and Optimizing LSTM Language Models

Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

Fraternal Dropout

Deep Residual Output Layers for Neural Language Generation

Learning Associative Inference Using Fast Weight Memory

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Partially Shuffling the Training Data to Improve Language Models

Direct Output Connection for a High-Rank Language Model

Partially Shuffling the Training Data to Improve Language Models

Mogrifier LSTM

Advancing State of the Art in Language Modeling

Direct Output Connection for a High-Rank Language Model

Regularizing and Optimizing LSTM Language Models

Dynamic Evaluation of Neural Sequence Models

Deep Residual Output Layers for Neural Language Generation

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Gradual Learning of Recurrent Neural Networks

Improved Language Modeling by Decoding the Past

FRAGE: Frequency-Agnostic Word Representation

Improving Neural Language Modeling via Adversarial Training

Mogrifier LSTM

Language Models with Transformers

Hydra: A System for Large Multi-Model Deep Learning

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Model	Paper	Test perplexity	Date
OPT-175B (50% Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	234.77	2023-01-02
Grave et al. (2016) - LSTM	Improving Neural Language Models with a Continuou…	99.30	2016-12-13
Inan et al. (2016) - Variational LSTM (tied) (h=650)	Tying Word Vectors and Word Classifiers: A Loss F…	87.70	2016-11-04
Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss	Tying Word Vectors and Word Classifiers: A Loss F…	87.00	2016-11-04
Grave et al. (2016) - LSTM + continuous cache pointer	Improving Neural Language Models with a Continuou…	68.90	2016-12-13
EGRU	Efficient recurrent architectures through activit…	68.90	2022-06-13
Melis et al. (2017) - 1-layer LSTM (tied)	On the State of the Art of Evaluation in Neural L…	65.90	2017-07-18
AWD-LSTM	Regularizing and Optimizing LSTM Language Models	65.80	2017-08-07
AWD-LSTM + ATOI	Alleviating Sequence Information Loss with Data O…	64.73	2019-09-18
AWD-LSTM 3-layer with Fraternal dropout	Fraternal Dropout	64.10	2017-10-31
AWD-LSTM-DRILL	Deep Residual Output Layers for Neural Language G…	61.90	2019-05-14
AWD-FWM Schlag et al. (2020)	Learning Associative Inference Using Fast Weight …	61.65	2020-11-16
AWD-LSTM-MoS	Breaking the Softmax Bottleneck: A High-Rank RNN …	61.45	2017-11-10
AWD-LSTM-MoS + Partial Shuffle	Partially Shuffling the Training Data to Improve …	59.98	2019-03-11
AWD-LSTM-DOC	Direct Output Connection for a High-Rank Language…	58.03	2018-08-30
AWD-LSTM-DOC + Partial Shuffle	Partially Shuffling the Training Data to Improve …	57.85	2019-03-11
Mogrifier LSTM	Mogrifier LSTM	55.10	2019-09-04
Ensemble of All	Advancing State of the Art in Language Modeling	53.73	2023-11-28
AWD-LSTM-DOC x5	Direct Output Connection for a High-Rank Language…	53.09	2018-08-30
AWD-LSTM + continuous cache pointer	Regularizing and Optimizing LSTM Language Models	52.00	2017-08-07
AWD-LSTM + dynamic eval	Dynamic Evaluation of Neural Sequence Models	44.30	2017-09-21
AWD-LSTM-DRILL + dynamic eval	Deep Residual Output Layers for Neural Language G…	42.00	2019-05-14
AWD-LSTM-MoS + dynamic eval	Breaking the Softmax Bottleneck: A High-Rank RNN …	40.68	2017-11-10
GL-LWGC + AWD-MoS-LSTM + dynamic eval	Gradual Learning of Recurrent Neural Networks	40.46	2017-08-29
Past Decode Reg. + AWD-LSTM-MoS + dyn. eval.	Improved Language Modeling by Decoding the Past	40.30	2018-08-14
FRAGE + AWD-LSTM-MoS + dynamic eval	FRAGE: Frequency-Agnostic Word Representation	39.14	2018-09-18
adversarial + AWD-LSTM-MoS + dynamic eval	Improving Neural Language Modeling via Adversaria…	38.65	2019-06-10
Mogrifier LSTM + dynamic eval	Mogrifier LSTM	38.60	2019-09-04
BERT-Large-CAS	Language Models with Transformers	34.10	2019-04-20
GPT-2 (fine-tuned)	Hydra: A System for Large Multi-Model Deep Learni…	15.17	2021-10-16
SparseGPT (175B, 2:4 Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	8.73	2023-01-02
SparseGPT (175B, 4:8 Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	8.45	2023-01-02
OPT-175B	SparseGPT: Massive Language Models Can Be Accurat…	8.34	2023-01-02
SparseGPT (175B, 50% Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	8.21	2023-01-02