MTEB

Massive Text Embedding Benchmark

Dataset Information
Modalities
Texts
Introduced
2022
Homepage

Overview

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.

Check the latest leaderboards at HuggingFace.

Variants: MTEB

Associated Benchmarks

This dataset is used in 5 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Semantic Textual Similarity AnglE-UAE AnglE-optimized Text Embeddings 2023-09-22
Semantic Textual Similarity ST5-XL MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity MPNet-multilingual MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity ST5-Large MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity SimCSE-BERT-sup MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity ST5-Base MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity ST5-XXL MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity SGPT-5.8B-nli MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity MPNet MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity MiniLM-L12 MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity GTR-XXL MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity MiniLM-L6 MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity GTR-Large MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity SGPT-5.8B-msmarco MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity GTR-XL MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity SGPT-BLOOM-7.1B-msmarco MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity GTR-Base MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity SGPT-2.7B-msmarco MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity coCondenser-msmarco MTEB: Massive Text Embedding Benchmark 2022-10-13
Semantic Textual Similarity Ada Similarity MTEB: Massive Text Embedding Benchmark 2022-10-13

Research Papers

Recent papers with results on this dataset: