BBH

BIG-Bench Hard

Dataset Information
Introduced
2022
License
Unknown
Homepage

Overview

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a suite of 23 challenging tasks from BIG-Bench that were found to be beyond the capabilities of current language models. These tasks are ones where prior language model evaluations did not outperform the average human-rater.

The BBH tasks require multi-step reasoning, and it was found that few-shot prompting without Chain-of-Thought (CoT), as done in the BIG-Bench evaluations, substantially underestimates the best performance and capabilities of language models. When CoT prompting was applied to BBH tasks, it enabled PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.

Variants: BBH-nlp, BBH-alg, Big-bench Hard, BBH

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Question Answering Shakti-LLM (2.5B) SHAKTI: A 2.5 Billion Parameter … 2024-10-15

Research Papers

Recent papers with results on this dataset: