Decay RNN
|
How much complexity does an RNN architecture need…
|
76.67
|
2020-05-17
|
|
GRU
|
How much complexity does an RNN architecture need…
|
53.78
|
2020-05-17
|
|
LSTM
|
How much complexity does an RNN architecture need…
|
52.73
|
2020-05-17
|
|
LSTM
|
Improving Neural Language Models with a Continuou…
|
48.70
|
2016-12-13
|
|
TCN
|
An Empirical Evaluation of Generic Convolutional …
|
45.19
|
2018-03-04
|
|
GCNN-8
|
Language Modeling with Gated Convolutional Networ…
|
44.90
|
2016-12-23
|
|
Neural cache model (size = 100)
|
Improving Neural Language Models with a Continuou…
|
44.80
|
2016-12-13
|
|
Neural cache model (size = 2,000)
|
Improving Neural Language Models with a Continuou…
|
40.80
|
2016-12-13
|
|
GCNN-8
|
Language Modeling with Gated Convolutional Networ…
|
37.20
|
2016-12-23
|
|
LSTM
|
Fast Parametric Learning with Activation Memoriza…
|
36.40
|
2018-03-27
|
|
LSTM (Hebbian)
|
Fast Parametric Learning with Activation Memoriza…
|
34.30
|
2018-03-27
|
|
4 layer QRNN
|
An Analysis of Neural Language Modeling at Multip…
|
33.00
|
2018-03-22
|
|
AWD-LSTM-MoS + ATOI
|
Alleviating Sequence Information Loss with Data O…
|
32.85
|
2019-09-18
|
|
DEQ-Transformer (small)
|
Deep Equilibrium Models
|
32.40
|
2019-09-03
|
|
LSTM (RMC)
|
Relational recurrent neural networks
|
31.60
|
2018-06-05
|
|
Primal.+Trans.
|
Primal-Attention: Self-attention through Asymmetr…
|
31.00
|
2023-05-31
|
|
Rfa-Gate-Gaussian-Stateful (Small)
|
Random Feature Attention
|
30.50
|
2021-03-03
|
|
LSTM (Hebbian, Cache)
|
Fast Parametric Learning with Activation Memoriza…
|
29.70
|
2018-03-27
|
|
LSTM (Hebbian, Cache, MbPA)
|
Fast Parametric Learning with Activation Memoriza…
|
29.20
|
2018-03-27
|
|
Trellis Network
|
Trellis Networks for Sequence Modeling
|
29.19
|
2018-10-15
|
|
DEQ-TrellisNet
|
Deep Equilibrium Models
|
29.00
|
2019-09-03
|
|
AdvSoft (+ 4 layer QRNN + dynamic eval)
|
Improving Neural Language Modeling via Adversaria…
|
28.00
|
2019-06-10
|
|
Performer 125M
|
Rethinking Attention with Performers
|
26.80
|
2020-09-30
|
|
Reformer 125M
|
Reformer: The Efficient Transformer
|
26.00
|
2020-01-13
|
|
FNetAR Medium
|
FNetAR: Mixing Tokens with Autoregressive Fourier…
|
25.81
|
2021-07-22
|
|
Linear Attention 125M
|
Transformers are RNNs: Fast Autoregressive Transf…
|
25.60
|
2020-06-29
|
|
Transformer-N
|
Revisiting Simple Neural Probabilistic Language M…
|
25.20
|
2021-04-08
|
|
[?]-former (Sticky memories)
|
$\infty$-former: Infinite Memory Transformer
|
24.22
|
2021-09-01
|
|
\infty-former (Sticky memories)
|
$\infty$-former: Infinite Memory Transformer
|
24.22
|
2021-09-01
|
|
∞-former (Sticky memories)
|
$\infty$-former: Infinite Memory Transformer
|
24.22
|
2021-09-01
|
|
DeLighT
|
DeLighT: Deep and Light-weight Transformer
|
24.14
|
2020-08-03
|
|
Transformer-XL Standard
|
Transformer-XL: Attentive Language Models Beyond …
|
24.00
|
2019-01-09
|
|
Hybrid H3 (125M)
|
Hungry Hungry Hippos: Towards Language Modeling w…
|
23.70
|
2022-12-28
|
|
Rfa-Gate-Gaussian-Stateful (Big)
|
Random Feature Attention
|
23.50
|
2021-03-03
|
|
TaLK Convolutions
|
Time-aware Large Kernel Convolutions
|
23.30
|
2020-02-08
|
|
DEQ-Transformer (medium, adaptive embed)
|
Deep Equilibrium Models
|
23.20
|
2019-09-03
|
|
Skip Cross-Head Transformer-XL
|
Memory-efficient Stochastic methods for Memory-ba…
|
22.91
|
2023-11-14
|
|
PAR Transformer Base
|
Pay Attention when Required
|
22.70
|
2020-09-09
|
|
Feedback Transformer (4 layers)
|
Addressing Some Limitations of Transformers with …
|
22.40
|
2020-02-21
|
|
S4
|
Efficiently Modeling Long Sequences with Structur…
|
21.28
|
2021-10-31
|
|
All-attention network (36 layers)
|
Augmenting Self-attention with Persistent Memory
|
20.60
|
2019-07-02
|
|
BERT-Large-CAS
|
Language Models with Transformers
|
20.40
|
2019-04-20
|
|
T2R + Pretrain
|
Finetuning Pretrained Transformers into RNNs
|
19.60
|
2021-03-24
|
|
Transformer (Adaptive inputs)
|
On the adequacy of untuned warmup for adaptive op…
|
19.50
|
2019-10-09
|
|
Transformer (Adaptive inputs)
|
Adaptive Input Representations for Neural Languag…
|
18.70
|
2018-09-28
|
|
Hyena-3
|
Hyena Hierarchy: Towards Larger Convolutional Lan…
|
18.60
|
2023-02-21
|
|
Hyena-3-slim
|
Hyena Hierarchy: Towards Larger Convolutional Lan…
|
18.50
|
2023-02-21
|
|
Hybrid H3 125M
|
Hungry Hungry Hippos: Towards Language Modeling w…
|
18.50
|
2022-12-28
|
|
PAR Transformer Large
|
Pay Attention when Required
|
18.40
|
2020-09-09
|
|
Perceiver AR 358M
|
General-purpose, long-context autoregressive mode…
|
18.40
|
2022-02-15
|
|
SRU++ Base
|
When Attention Meets Fast Recurrence: Training La…
|
18.30
|
2021-02-24
|
|
Transformer-XL Large
|
Transformer-XL: Attentive Language Models Beyond …
|
18.30
|
2019-01-09
|
|
Feedback Transformer (8 layers)
|
Addressing Some Limitations of Transformers with …
|
18.20
|
2020-02-21
|
|
Shortformer
|
Shortformer: Better Language Modeling using Short…
|
18.15
|
2020-12-31
|
|
Mega
|
Mega: Moving Average Equipped Gated Attention
|
18.07
|
2022-09-21
|
|
DIFFQ (λ=1, g=16)
|
Differentiable Model Compression via Pseudo Quant…
|
18.00
|
2021-04-20
|
|
Sandwich Transformer
|
Improving Transformer Models by Reordering their …
|
17.96
|
2019-11-10
|
|
Transformer+SSA
|
The Information Pathways Hypothesis: Transformers…
|
17.60
|
2023-06-02
|
|
Staged Training
|
Shortformer: Better Language Modeling using Short…
|
17.56
|
2020-12-31
|
|
Transformer-XL Large + Phrase Induction
|
Improving Neural Language Models by Segmenting, A…
|
17.40
|
2019-06-04
|
|
Transformer+SSA+Self-ensemble
|
The Information Pathways Hypothesis: Transformers…
|
17.18
|
2023-06-02
|
|
Compressive Transformer (18L, M=1024)
|
Compressive Transformers for Long-Range Sequence …
|
17.10
|
2019-11-13
|
|
SRU++ Large
|
When Attention Meets Fast Recurrence: Training La…
|
17.10
|
2021-02-24
|
|
SegaTransformer-XL
|
Segatron: Segment-Aware Transformer for Language …
|
17.10
|
2020-04-30
|
|
Transformer-XL (SGD dynamic eval)
|
Dynamic Evaluation of Transformer Language Models
|
17.00
|
2019-04-17
|
|
Hybrid H3 (355M)
|
Hungry Hungry Hippos: Towards Language Modeling w…
|
16.90
|
2022-12-28
|
|
∞-former (initialized GPT-2 Small)
|
$\infty$-former: Infinite Memory Transformer
|
16.64
|
2021-09-01
|
|
[?]-former (SM)
|
$\infty$-former: Infinite Memory Transformer
|
16.61
|
2021-09-01
|
|
-former (SM)
|
$\infty$-former: Infinite Memory Transformer
|
16.61
|
2021-09-01
|
|
∞-former (Sticky memories + initialized GPT-2 Small)
|
$\infty$-former: Infinite Memory Transformer
|
16.61
|
2021-09-01
|
|
Transformer-XL (RMS dynamic eval)
|
Dynamic Evaluation of Transformer Language Models
|
16.40
|
2019-04-17
|
|
kNN-LM
|
Generalization through Memorization: Nearest Neig…
|
16.12
|
2019-11-01
|
|
Routing Transformer
|
Efficient Content-Based Sparse Attention with Rou…
|
15.80
|
2020-03-12
|
|
kNN-LM w/ Continuous Cache
|
Generalization through Memorization: Nearest Neig…
|
15.79
|
2019-11-01
|
|
kNN-LM w/ Adaptive Coefficient
|
You can't pick your neighbors, or can you? When a…
|
15.50
|
2022-10-28
|
|
GateLoop (125M)
|
GateLoop: Fully Data-Controlled Linear Recurrence…
|
13.40
|
2023-11-03
|
|
Ensemble of All
|
Advancing State of the Art in Language Modeling
|
13.29
|
2023-11-28
|
|
Hybrid H3 (1.3B)
|
Hungry Hungry Hippos: Towards Language Modeling w…
|
12.50
|
2022-12-28
|
|
GLM-XXLarge (unidirectional)
|
GLM: General Language Model Pretraining with Auto…
|
12.22
|
2021-03-18
|
|
GLM-XXLarge (bidirectional)
|
GLM: General Language Model Pretraining with Auto…
|
11.33
|
2021-03-18
|
|
Megatron-LM
|
Megatron-LM: Training Multi-Billion Parameter Lan…
|
10.81
|
2019-09-17
|
|
Hybrid H3 (2.7B)
|
Hungry Hungry Hippos: Towards Language Modeling w…
|
10.60
|
2022-12-28
|
|
RETRO (7.5B)
|
Improving language models by retrieving from tril…
|
2.40
|
2021-12-08
|
|