UnitedSynT5 (3B)
|
First Train to Generate, then Generate to Train: …
|
92.60
|
2024-12-12
|
|
T5
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
92.00
|
2019-11-08
|
|
T5-XXL 11B (fine-tuned)
|
Exploring the Limits of Transfer Learning with a …
|
92.00
|
2019-10-23
|
|
T5-11B
|
Exploring the Limits of Transfer Learning with a …
|
91.70
|
2019-10-23
|
|
T5-3B
|
Exploring the Limits of Transfer Learning with a …
|
91.40
|
2019-10-23
|
|
ALBERT
|
ALBERT: A Lite BERT for Self-supervised Learning …
|
91.30
|
2019-09-26
|
|
DeBERTa (large)
|
DeBERTa: Decoding-enhanced BERT with Disentangled…
|
91.10
|
2020-06-05
|
|
Adv-RoBERTa ensemble
|
StructBERT: Incorporating Language Structures int…
|
91.10
|
2019-08-13
|
|
SMARTRoBERTa
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
91.10
|
2019-11-08
|
|
RoBERTa
|
RoBERTa: A Robustly Optimized BERT Pretraining Ap…
|
90.80
|
2019-07-26
|
|
XLNet (single model)
|
XLNet: Generalized Autoregressive Pretraining for…
|
90.80
|
2019-06-19
|
|
RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)
|
LLM.int8(): 8-bit Matrix Multiplication for Trans…
|
90.20
|
2022-08-15
|
|
RoBERTa (ensemble)
|
RoBERTa: A Robustly Optimized BERT Pretraining Ap…
|
90.20
|
2019-07-26
|
|
T5-Large
|
Exploring the Limits of Transfer Learning with a …
|
89.90
|
2019-10-23
|
|
PSQ (Chen et al., 2020)
|
A Statistical Framework for Low-bitwidth Training…
|
89.90
|
2020-10-27
|
|
UnitedSynT5 (335M)
|
First Train to Generate, then Generate to Train: …
|
89.80
|
2024-12-12
|
|
T5-Large 770M
|
Exploring the Limits of Transfer Learning with a …
|
89.60
|
2019-10-23
|
|
ERNIE 2.0 Large
|
ERNIE 2.0: A Continual Pre-training Framework for…
|
88.70
|
2019-07-29
|
|
SpanBERT
|
SpanBERT: Improving Pre-training by Representing …
|
88.10
|
2019-07-24
|
|
BERT-Large
|
FNet: Mixing Tokens with Fourier Transforms
|
88.00
|
2021-05-09
|
|
ASA + RoBERTa
|
Adversarial Self-Attention for Language Understan…
|
88.00
|
2022-06-25
|
|
MT-DNN-ensemble
|
Improving Multi-Task Deep Neural Networks via Kno…
|
87.90
|
2019-04-20
|
|
Q-BERT (Shen et al., 2020)
|
Q-BERT: Hessian Based Ultra Low Precision Quantiz…
|
87.80
|
2019-09-12
|
|
Snorkel MeTaL (ensemble)
|
Training Complex Models with Multi-Task Weak Supe…
|
87.60
|
2018-10-05
|
|
BigBird
|
Big Bird: Transformers for Longer Sequences
|
87.50
|
2020-07-28
|
|
T5-Base
|
Exploring the Limits of Transfer Learning with a …
|
87.10
|
2019-10-23
|
|
MT-DNN
|
Multi-Task Deep Neural Networks for Natural Langu…
|
86.70
|
2019-01-31
|
|
BERT-LARGE
|
BERT: Pre-training of Deep Bidirectional Transfor…
|
86.70
|
2018-10-11
|
|
RealFormer
|
RealFormer: Transformer Likes Residual Attention
|
86.28
|
2020-12-21
|
|
gMLP-large
|
Pay Attention to MLPs
|
86.20
|
2021-05-17
|
|
ERNIE 2.0 Base
|
ERNIE 2.0: A Continual Pre-training Framework for…
|
86.10
|
2019-07-29
|
|
MT-DNN-SMARTv0
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
85.70
|
2019-11-08
|
|
MT-DNN-SMART
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
85.70
|
2019-11-08
|
|
Q8BERT (Zafrir et al., 2019)
|
Q8BERT: Quantized 8Bit BERT
|
85.60
|
2019-10-14
|
|
SMART+BERT-BASE
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
85.60
|
2019-11-08
|
|
SMART-BERT
|
SMART: Robust and Efficient Fine-Tuning for Pre-t…
|
85.60
|
2019-11-08
|
|
ASA + BERT-base
|
Adversarial Self-Attention for Language Understan…
|
85.00
|
2022-06-25
|
|
TinyBERT-6 67M
|
TinyBERT: Distilling BERT for Natural Language Un…
|
84.60
|
2019-09-23
|
|
ELC-BERT-base 98M (zero init)
|
Not all layers are equally as important: Every La…
|
84.40
|
2023-11-03
|
|
24hBERT
|
How to Train BERT with an Academic Budget
|
84.40
|
2021-04-15
|
|
ERNIE
|
ERNIE: Enhanced Language Representation with Info…
|
84.00
|
2019-05-17
|
|
Charformer-Tall
|
Charformer: Fast Character Transformers via Gradi…
|
83.70
|
2021-06-23
|
|
LTG-BERT-base 98M
|
Not all layers are equally as important: Every La…
|
83.00
|
2023-11-03
|
|
TinyBERT-4 14.5M
|
TinyBERT: Distilling BERT for Natural Language Un…
|
82.50
|
2019-09-23
|
|
T5-Small
|
Exploring the Limits of Transfer Learning with a …
|
82.40
|
2019-10-23
|
|
SqueezeBERT
|
SqueezeBERT: What can computer vision teach NLP a…
|
82.00
|
2020-06-19
|
|
GPST(unsupervised generative syntactic LM)
|
Generative Pretrained Structured Transformers: Un…
|
81.80
|
2024-03-13
|
|
ELC-BERT-small 24M
|
Not all layers are equally as important: Every La…
|
79.20
|
2023-11-03
|
|
LTG-BERT-small 24M
|
Not all layers are equally as important: Every La…
|
78.00
|
2023-11-03
|
|
FNet-Large
|
FNet: Mixing Tokens with Fourier Transforms
|
78.00
|
2021-05-09
|
|
aESIM
|
Attention Boosted Sequential Inference Model
|
73.90
|
2018-12-05
|
|
T5-Large 738M
|
LaMini-LM: A Diverse Herd of Distilled Models fro…
|
72.40
|
2023-04-27
|
|
Multi-task BiLSTM + Attn
|
GLUE: A Multi-Task Benchmark and Analysis Platfor…
|
72.20
|
2018-04-20
|
|
Stacked Bi-LSTMs (shortcut connections, max-pooling)
|
Combining Similarity Features and Deep Representa…
|
71.40
|
2018-11-02
|
|
GenSen
|
Learning General Purpose Distributed Sentence Rep…
|
71.40
|
2018-03-30
|
|
Bi-LSTM sentence encoder (max-pooling)
|
Combining Similarity Features and Deep Representa…
|
70.70
|
2018-11-02
|
|
Stacked Bi-LSTMs (shortcut connections, max-pooling, attention)
|
Combining Similarity Features and Deep Representa…
|
70.70
|
2018-11-02
|
|
LM-CPPF RoBERTa-base
|
LM-CPPF: Paraphrasing-Guided Data Augmentation fo…
|
68.40
|
2023-05-29
|
|
SWEM-max
|
Baseline Needs More Love: On Simple Word-Embeddin…
|
68.20
|
2018-05-24
|
|
LaMini-GPT 1.5B
|
LaMini-LM: A Diverse Herd of Distilled Models fro…
|
67.50
|
2023-04-27
|
|
LaMini-F-T5 783M
|
LaMini-LM: A Diverse Herd of Distilled Models fro…
|
61.40
|
2023-04-27
|
|
LaMini-T5 738M
|
LaMini-LM: A Diverse Herd of Distilled Models fro…
|
54.70
|
2023-04-27
|
|
GPT-2-XL 1.5B
|
LaMini-LM: A Diverse Herd of Distilled Models fro…
|
36.50
|
2023-04-27
|
|