Colossal Clean Crawled Corpus
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.
The dataset can be downloaded in a pre-processed form from allennlp.
Variants: C4, c4 en
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Language Modelling | LLM.float32 1.3B | LLM.int8(): 8-bit Matrix Multiplication for … | 2022-08-15 |
Language Modelling | Zeropoint LLM.int8 13B (vector-wise + decomp) | LLM.int8(): 8-bit Matrix Multiplication for … | 2022-08-15 |
Language Modelling | LLM.float32 6.7B | LLM.int8(): 8-bit Matrix Multiplication for … | 2022-08-15 |
Language Modelling | LLM.float32 2.7B | LLM.int8(): 8-bit Matrix Multiplication for … | 2022-08-15 |
Language Modelling | N-Grammer 288M | N-Grammer: Augmenting Transformers with latent … | 2022-07-13 |
Language Modelling | N-Grammer 343M | N-Grammer: Augmenting Transformers with latent … | 2022-07-13 |
Language Modelling | Primer | Primer: Searching for Efficient Transformers … | 2021-09-17 |
Language Modelling | T5++ | Primer: Searching for Efficient Transformers … | 2021-09-17 |
Language Modelling | Original T5 | Primer: Searching for Efficient Transformers … | 2021-09-17 |
Recent papers with results on this dataset: