OpenWebText

Dataset Information
Modalities
Texts
Languages
Kabyle
License
Homepage

Overview

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).

Source: RoBERTa: A Robustly Optimized BERT Pretraining Approach

Variants: OpenWebText

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Language Modelling MDLM-Prime Beyond Masked and Unmasked: Discrete … 2025-05-24
Language Modelling BD3-LMs Block Diffusion: Interpolating Between Autoregressive … 2025-03-12
Text Generation GPT2-Hermite Polynomial, trigonometric, and tropical activations 2025-02-03
Language Modelling GPT2-Hermite Polynomial, trigonometric, and tropical activations 2025-02-03
Language Modelling GPT2-Tropical Polynomial, trigonometric, and tropical activations 2025-02-03
Language Modelling GPT2-Fourier Polynomial, trigonometric, and tropical activations 2025-02-03
Language Modelling GPT2-GELU Polynomial, trigonometric, and tropical activations 2025-02-03
Language Modelling EDLM-NCE Energy-Based Diffusion Language Models for … 2024-10-28
Language Modelling EDLM-coAR Energy-Based Diffusion Language Models for … 2024-10-28
Text Generation GPT2-81M-LOOP Loop Neural Networks for Parameter … 2024-09-21
Language Modelling ARM Simple and Effective Masked Diffusion … 2024-06-11
Language Modelling MDLM Simple and Effective Masked Diffusion … 2024-06-11
Language Modelling GenMD4 Simplified and Generalized Masked Diffusion … 2024-06-06
Language Modelling SEDD Discrete Diffusion Modeling by Estimating … 2023-10-25

Research Papers

Recent papers with results on this dataset: