← ML Research Wiki / 2306.01116

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only The Falcon LLM team

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay (2023)

Paper Information

arXiv ID

2306.01116

Venue

arXiv.org

Domain

Natural language processing

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large language models are commonly trained on a mixture of filtered web data and curated "high-quality" corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our REFINEDWEB dataset, and 1.3/7.5B parameters language models trained on it * . 10 1 10 2 Compute [PF-days] 45 50 55 60 65 Aggregated zero-shot performance [main-agg %]

Summary

The paper discusses the development and evaluation of the REFINEDWEB dataset for training large language models (LLMs) like Falcon. It contests the conventional practice of mixing curated and web data by demonstrating that adequately filtered web data alone can create powerful models that outperform those trained on curated datasets (e.g., The Pile). The authors detail the creation process of the five trillion token dataset, focusing on stringent filtering and deduplication methods to enhance data quality. Key findings include significant improvements in zero-shot performance compared to curated datasets, thus altering prevailing assumptions about the necessity of curated corpora for effective language modeling. The dataset and models are publicly released to aid future research in natural language processing.

Methods

This paper employs the following methods:

MacroData Refinement (MDR)

Models Used

Falcon
Falcon-RW

Datasets

The following datasets were used in this research:

REFINEDWEB
CommonCrawl
The Pile
C4
OSCAR

Evaluation Metrics

Zero-shot performance
Accuracy

Results

Models trained on REFINEDWEB outperform those trained on curated corpora based on zero-shot benchmarks.

Limitations

The authors identified the following limitations:

Potential biases in the dataset.
Concerns regarding toxic content similar to curated sources.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

web data pretraining dataset deduplication filtering large language models

External Resources

Funding: The Falcon LLM team
References: 95
Influential Citations: 96

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only The Falcon LLM team

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers