arXiv-10

Dataset Information
Modalities
Texts
Languages
English
Introduced
2022
License
Open Source
Homepage

Overview

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers.
This dataset contains 10 classes and is balanced (exactly 10,000 per class).
The classes include subcategories of computer science, physics, and math.

• Direct link: Download

• Citation:

@inproceedings{farhangi2022protoformer,
  title={Protoformer: Embedding Prototypes for Transformers},
  author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan},
  booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part I},
  pages={447--458},
  year={2022}
}

Variants: arXiv-10

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Text Classification Protoformer Protoformer: Embedding Prototypes for Transformers 2022-06-25
Text Classification RoBERTa RoBERTa: A Robustly Optimized BERT … 2019-07-26
Text Classification DocBERT DocBERT: BERT for Document Classification 2019-04-17

Research Papers

Recent papers with results on this dataset: