C4

Colossal Clean Crawled Corpus

Dataset Information
Modalities
Texts
Languages
Thai
Introduced
2019
License
Unknown
Homepage

Overview

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.

Variants: C4, c4 en

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Language Modelling LLM.float32 1.3B LLM.int8(): 8-bit Matrix Multiplication for … 2022-08-15
Language Modelling Zeropoint LLM.int8 13B (vector-wise + decomp) LLM.int8(): 8-bit Matrix Multiplication for … 2022-08-15
Language Modelling LLM.float32 6.7B LLM.int8(): 8-bit Matrix Multiplication for … 2022-08-15
Language Modelling LLM.float32 2.7B LLM.int8(): 8-bit Matrix Multiplication for … 2022-08-15
Language Modelling N-Grammer 288M N-Grammer: Augmenting Transformers with latent … 2022-07-13
Language Modelling N-Grammer 343M N-Grammer: Augmenting Transformers with latent … 2022-07-13
Language Modelling Primer Primer: Searching for Efficient Transformers … 2021-09-17
Language Modelling T5++ Primer: Searching for Efficient Transformers … 2021-09-17
Language Modelling Original T5 Primer: Searching for Efficient Transformers … 2021-09-17

Research Papers

Recent papers with results on this dataset: