ML Research Wiki / Benchmarks / Sentence Completion / HellaSwag

HellaSwag

Sentence Completion Benchmark

Performance Over Time

📊 Showing 86 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training with Task Prefix	96.10	2022-10-12	📦 cooelf/compassmtl
2	CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training with Task Prefix	95.60	2022-10-12	📦 cooelf/compassmtl
3	DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	95.60	2022-10-29	📦 declare-lab/team
4	GPT-4 (10-shot)	GPT-4 Technical Report	95.30	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
5	LLaMA3+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	95.00	2024-06-16	📦 wutaiqiang/moslora
6	DeBERTa-Large 304M	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	94.70	2022-10-29	📦 declare-lab/team
7	LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	94.70	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
8	Unicorn 11B (fine-tuned) 📚	UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark	93.90	2021-03-24	📦 allenai/rainbow
9	LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	93.30	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
10	LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	93.10	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA

All Papers (86)

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M with Tailor

cooelf/compassmtl

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M

cooelf/compassmtl

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M (classification-based)

declare-lab/team

GPT-4 Technical Report

2023

GPT-4 (10-shot)

openai/evals shmsw25/factscore

Mixture-of-Subspaces in Low-Rank Adaptation

2024

LLaMA3+MoSLoRA

wutaiqiang/moslora

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M

declare-lab/team

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 13B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

2021

Unicorn 11B (fine-tuned)

allenai/rainbow

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-3 8B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 7B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

2020

DeBERTa++

huggingface/transformers microsoft/DeBERTa

DiscoSense: Commonsense Reasoning with Discourse Connectives

2022

ELECTRA-Large 335M (fine-tuned on DiscoSense and HellaSwag)

prajjwal1/discosense

PaLM 2 Technical Report

2023

PaLM 2-L (1-shot)

eternityyw/tram-benchmark

DiscoSense: Commonsense Reasoning with Discourse Connectives

2022

ELECTRA-Large 335M (fine-tuned on HellaSwag)

prajjwal1/discosense

PaLM 2 Technical Report

2023

PaLM 2-M (1-shot)

eternityyw/tram-benchmark

Muppet: Massive Multi-task Representations with Pre-Finetuning

2021

MUPPET Roberta Large

facebook/muppet-roberta-base facebook/muppet-roberta-large

Stay on topic with Classifier-Free Guidance

2023

LLaMA 65B + CFG (0-shot)

The Falcon Series of Open Language Models

2023

Falcon-180B (0-shot)

PaLM 2 Technical Report

2023

PaLM 2-S (1-shot)

eternityyw/tram-benchmark

GPT-4 Technical Report

2023

GPT-3.5 (10-shot)

openai/evals shmsw25/factscore

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

RoBERTa-Large Ensemble

huggingface/transformers pytorch/fairseq

Stay on topic with Classifier-Free Guidance

2023

LLaMA 30B + CFG (0-shot)

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

2019

HyKAS+CSKG

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (0-shot)

huggingface/transformers ggml-org/llama.cpp

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (Few-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (1-shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

ExDeBERTa 567M

cooelf/compassmtl

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (0-shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 34B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Camelidae-8×34B (10-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B (0-shot)

huggingface/transformers ggml-org/llama.cpp

The Falcon Series of Open Language Models

2023

Falcon-40B (0-shot)

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

2022

Megatron-Turing NLG 530B (Few-Shot)

microsoft/DeepSpeed NVIDIA/NeMo-Curator

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Qwen2idae-16x14B (10-shot)

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

Stay on topic with Classifier-Free Guidance

2023

LLaMA 13B + CFG (0-shot)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

RoBERTa-Large 355M

huggingface/transformers pytorch/fairseq

Mistral 7B

2023

Mistral 7B (0-shot)

mistralai/mistral-src facebookresearch/fairseq2

Training Compute-Optimal Large Language Models

2022

Chinchilla 70B (0-shot)

karpathy/llama2.c nkluge-correa/teenytinyllama

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 13B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

2022

Megatron-Turing NLG 530B (1-shot)

microsoft/DeepSpeed NVIDIA/NeMo-Curator

Language Models are Few-Shot Learners

2020

GPT-3 175B (few-shot, k=32)

ggml-org/llama.cpp ggerganov/llama.cpp

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 280B (0-shot)

allenai/dolma rvlopes/gloria bramiozo/PubScience

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B (0-shot)

huggingface/transformers ggml-org/llama.cpp

Language Models are Few-Shot Learners

2020

GPT-3 (0-shot)

ggml-org/llama.cpp ggerganov/llama.cpp

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 7B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

The Falcon Series of Open Language Models

2023

Falcon-7B (0-shot)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B (0-shot)

huggingface/transformers ggml-org/llama.cpp

BloombergGPT: A Large Language Model for Finance

2023

BlooombergGPT 50B (1-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

OPT 66B (1-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

BLOOM 176B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Sheared-LLaMA-2.7B (50B)

princeton-nlp/llm-shearing hexuandeng/drpruning

BloombergGPT: A Large Language Model for Finance

2023

GPT-NeoX 20B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Open-LLaMA-3B-v2

princeton-nlp/llm-shearing hexuandeng/drpruning

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2023

Mamba-2.8B

state-spaces/mamba hustvl/vim

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Sheared-LLaMA-1.3B (50B)

princeton-nlp/llm-shearing hexuandeng/drpruning

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (3-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2023

Mamba-1.4B

state-spaces/mamba hustvl/vim

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (0-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Efficient Language Modeling with Sparse all-MLP

2022

sMLP – deterministic 9.4B (0-shot)

Efficient Language Modeling with Sparse all-MLP

2022

Switch Transformer 9B

Language Models are Few-Shot Learners

2020

GPT-3 Large 760M (0-shot)

ggml-org/llama.cpp ggerganov/llama.cpp

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

GPT-2-XL 1.5B

mbzuai-nlp/lamini-lm

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

2023

OPT-6.7B

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

2023

LLM in a Flash (OPT-6.7B with Predictor)

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

FLAN-T5-Large 783M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-GPT 1.5B

mbzuai-nlp/lamini-lm

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

BERT-Large 340M

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-F-T5 783M

mbzuai-nlp/lamini-lm

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

GPT-1 117M

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners

2022

Flipped-3B

seonghyeonye/flipped-learning

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

2023

T0-3B (CoT fine-tuned)

kaistai/cot-collection kaist-lklab/cot-collection

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-T5 738M

mbzuai-nlp/lamini-lm

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

BERT-Base 110M

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

T5-Large 738M

mbzuai-nlp/lamini-lm

Efficient Language Modeling with Sparse all-MLP

2022

Gshard 9B

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

LSTM + BERT-Base

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

Exploring the Benefits of Training Expert Language Models over Instruction Tuning

2023

RoE-3B

joeljang/rlphf joeljang/elm

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

ESIM + ElMo

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

Efficient Language Modeling with Sparse all-MLP

2022

HASH Layers 10B (0-shot)

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

LSTM + GloVe

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

fastText

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

LSTM + ElMo

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

Efficient Language Modeling with Sparse all-MLP

2022

Base Layers 10B (0-shot)

Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models

2022

KiC-770M

HellaSwag: Can a Machine Really Finish Your Sentence?

2019

Random chance baseline

facebookresearch/text_characterization_toolkit PlusLabNLP/Plot-guided-Coherence-Evaluation

Model	Paper	Accuracy	Date
CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training wit…	96.10	2022-10-12
CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training wit…	95.60	2022-10-12
DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as…	95.60	2022-10-29
GPT-4 (10-shot)	GPT-4 Technical Report	95.30	2023-03-15
LLaMA3+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	95.00	2024-06-16
DeBERTa-Large 304M	Two is Better than Many? Binary Classification as…	94.70	2022-10-29
LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	94.70	2024-04-22
Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reaso…	93.90	2021-03-24
LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	93.30	2024-04-22
LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	93.10	2024-04-22
DeBERTa++	DeBERTa: Decoding-enhanced BERT with Disentangled…	93.00	2020-06-05
ELECTRA-Large 335M (fine-tuned on DiscoSense and HellaSwag)	DiscoSense: Commonsense Reasoning with Discourse …	91.50	2022-10-22
PaLM 2-L (1-shot)	PaLM 2 Technical Report	87.40	2023-05-17
ELECTRA-Large 335M (fine-tuned on HellaSwag)	DiscoSense: Commonsense Reasoning with Discourse …	86.90	2022-10-22
PaLM 2-M (1-shot)	PaLM 2 Technical Report	86.70	2023-05-17
MUPPET Roberta Large	Muppet: Massive Multi-task Representations with P…	86.40	2021-01-26
LLaMA 65B + CFG (0-shot)	Stay on topic with Classifier-Free Guidance	86.30	2023-06-30
Falcon-180B (0-shot)	The Falcon Series of Open Language Models	85.90	2023-11-28
PaLM 2-S (1-shot)	PaLM 2 Technical Report	85.60	2023-05-17
GPT-3.5 (10-shot)	GPT-4 Technical Report	85.50	2023-03-15
RoBERTa-Large Ensemble	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	85.50	2019-07-26
LLaMA 30B + CFG (0-shot)	Stay on topic with Classifier-Free Guidance	85.30	2023-06-30
LLaMA 2 70B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	85.30	2023-07-18
HyKAS+CSKG	Towards Generalizable Neuro-Symbolic Systems for …	85.00	2019-10-30
LLaMA 65B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	84.20	2023-02-27
PaLM-540B (Few-Shot)	PaLM: Scaling Language Modeling with Pathways	83.80	2022-04-05
PaLM-540B (1-shot)	PaLM: Scaling Language Modeling with Pathways	83.60	2022-04-05
ExDeBERTa 567M	Task Compass: Scaling Multi-task Pre-training wit…	83.60	2022-10-12
PaLM-540B (0-shot)	PaLM: Scaling Language Modeling with Pathways	83.40	2022-04-05
LLaMA 2 34B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	83.30	2023-07-18
Camelidae-8×34B (10-shot)	Parameter-Efficient Sparsity Crafting from Dense …	83.20	2024-01-05
LLaMA 33B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	82.80	2023-02-27
Falcon-40B (0-shot)	The Falcon Series of Open Language Models	82.70	2023-11-28
Megatron-Turing NLG 530B (Few-Shot)	Using DeepSpeed and Megatron to Train Megatron-Tu…	82.40	2022-01-28
Qwen2idae-16x14B (10-shot)	Parameter-Efficient Sparsity Crafting from Dense …	82.30	2024-01-05
LLaMA 13B + CFG (0-shot)	Stay on topic with Classifier-Free Guidance	82.10	2023-06-30
RoBERTa-Large 355M	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	81.70	2019-07-26
Mistral 7B (0-shot)	Mistral 7B	81.30	2023-10-10
Chinchilla 70B (0-shot)	Training Compute-Optimal Large Language Models	80.80	2022-03-29
LLaMA 2 13B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	80.70	2023-07-18
Megatron-Turing NLG 530B (1-shot)	Using DeepSpeed and Megatron to Train Megatron-Tu…	80.20	2022-01-28
GPT-3 175B (few-shot, k=32)	Language Models are Few-Shot Learners	79.30	2020-05-28
Gopher 280B (0-shot)	Scaling Language Models: Methods, Analysis & Insi…	79.20	2021-12-08
LLaMA 13B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	79.20	2023-02-27
GPT-3 (0-shot)	Language Models are Few-Shot Learners	78.90	2020-05-28
LLaMA 2 7B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	77.20	2023-07-18
Falcon-7B (0-shot)	The Falcon Series of Open Language Models	76.30	2023-11-28
LLaMA 7B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	76.10	2023-02-27
BlooombergGPT 50B (1-shot)	BloombergGPT: A Large Language Model for Finance	73.90	2023-03-30
OPT 66B (1-shot)	BloombergGPT: A Large Language Model for Finance	73.50	2023-03-30
BLOOM 176B (1-shot)	BloombergGPT: A Large Language Model for Finance	73.20	2023-03-30
Sheared-LLaMA-2.7B (50B)	Sheared LLaMA: Accelerating Language Model Pre-tr…	70.80	2023-10-10
GPT-NeoX 20B (1-shot)	BloombergGPT: A Large Language Model for Finance	68.40	2023-03-30
Open-LLaMA-3B-v2	Sheared LLaMA: Accelerating Language Model Pre-tr…	67.60	2023-10-10
Mamba-2.8B	Mamba: Linear-Time Sequence Modeling with Selecti…	66.10	2023-12-01
Sheared-LLaMA-1.3B (50B)	Sheared LLaMA: Accelerating Language Model Pre-tr…	60.70	2023-10-10
FLAN 137B (3-shot)	Finetuned Language Models Are Zero-Shot Learners	59.20	2021-09-03
Mamba-1.4B	Mamba: Linear-Time Sequence Modeling with Selecti…	59.10	2023-12-01
FLAN 137B (0-shot)	Finetuned Language Models Are Zero-Shot Learners	56.70	2021-09-03
sMLP – deterministic 9.4B (0-shot)	Efficient Language Modeling with Sparse all-MLP	54.50	2022-03-14
Switch Transformer 9B	Efficient Language Modeling with Sparse all-MLP	52.50	2022-03-14
GPT-3 Large 760M (0-shot)	Language Models are Few-Shot Learners	51.00	2020-05-28
GPT-2-XL 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	50.90	2023-04-27
OPT-6.7B	LLM in a flash: Efficient Large Language Model In…	50.30	2023-12-12
LLM in a Flash (OPT-6.7B with Predictor)	LLM in a flash: Efficient Large Language Model In…	49.80	2023-12-12
FLAN-T5-Large 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	48.70	2023-04-27
LaMini-GPT 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	48.30	2023-04-27
BERT-Large 340M	HellaSwag: Can a Machine Really Finish Your Sente…	47.30	2019-05-19
LaMini-F-T5 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	43.70	2023-04-27
GPT-1 117M	HellaSwag: Can a Machine Really Finish Your Sente…	41.70	2019-05-19
Flipped-3B	Guess the Instruction! Flipped Learning Makes Lan…	41.60	2022-10-06
T0-3B (CoT fine-tuned)	The CoT Collection: Improving Zero-shot and Few-s…	41.10	2023-05-23
LaMini-T5 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	40.60	2023-04-27
BERT-Base 110M	HellaSwag: Can a Machine Really Finish Your Sente…	40.50	2019-05-19
T5-Large 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	38.90	2023-04-27
Gshard 9B	Efficient Language Modeling with Sparse all-MLP	38.00	2022-03-14
LSTM + BERT-Base	HellaSwag: Can a Machine Really Finish Your Sente…	36.20	2019-05-19
RoE-3B	Exploring the Benefits of Training Expert Languag…	34.60	2023-02-07
ESIM + ElMo	HellaSwag: Can a Machine Really Finish Your Sente…	33.30	2019-05-19
HASH Layers 10B (0-shot)	Efficient Language Modeling with Sparse all-MLP	33.00	2022-03-14
LSTM + GloVe	HellaSwag: Can a Machine Really Finish Your Sente…	31.70	2019-05-19
fastText	HellaSwag: Can a Machine Really Finish Your Sente…	31.60	2019-05-19
LSTM + ElMo	HellaSwag: Can a Machine Really Finish Your Sente…	31.40	2019-05-19
Base Layers 10B (0-shot)	Efficient Language Modeling with Sparse all-MLP	30.20	2022-03-14
KiC-770M	Knowledge-in-Context: Towards Knowledgeable Semi-…	29.60	2022-10-28
Random chance baseline	HellaSwag: Can a Machine Really Finish Your Sente…	25.00	2019-05-19

HellaSwag

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (86)