Wiki-40B

Dataset Information
License
Unknown
Homepage

Overview

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round 40 billion characters and aimed to accelerate the research of multilingual modeling.

Source: Wiki-40B: Multilingual Language Model Dataset

Variants: Wiki-40B

Associated Benchmarks

This dataset is used in 3 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Benchmarking OutEffHop-Bert_base Outlier-Efficient Hopfield Layers for Large … 2024-04-04
Quantization OutEffHop-Bert_base Outlier-Efficient Hopfield Layers for Large … 2024-04-04
Language Modelling FLASH-Quad-8k Transformer Quality in Linear Time 2022-02-21
Language Modelling Combiner-Axial-8k Combiner: Full Attention Transformer with … 2021-07-12
Language Modelling Combiner-Fixed-8k Combiner: Full Attention Transformer with … 2021-07-12

Research Papers

Recent papers with results on this dataset: