BBH

Name: BBH
Published: 2022-10-17
License: Unknown

BIG-Bench Hard

Dataset Information

Introduced

2022

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a suite of 23 challenging tasks from BIG-Bench that were found to be beyond the capabilities of current language models. These tasks are ones where prior language model evaluations did not outperform the average human-rater.

The BBH tasks require multi-step reasoning, and it was found that few-shot prompting without Chain-of-Thought (CoT), as done in the BIG-Bench evaluations, substantially underestimates the best performance and capabilities of language models. When CoT prompting was applied to BBH tasks, it enabled PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.

Variants: BBH-nlp, BBH-alg, Big-bench Hard, BBH

Associated Benchmarks

This dataset is used in 1 benchmark:

Question Answering - Metrics: Accuracy

Recent Benchmark Submissions

Task	Model	Paper	Date
Question Answering	Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter …	2024-10-15

Research Papers

Recent papers with results on this dataset:

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments (2024) -

External Links:

BBH

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview