HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
Variants: HellaSwag, HellaSwag (10-Shot), HellaSwag TR
This dataset is used in 4 benchmarks:
Recent papers with results on this dataset: