arXiv-10

Name: arXiv-10
Published: 2022-06-25
License: Open Source

Dataset Information

Modalities

Texts

Languages

English

Introduced

2022

License

Open Source

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers.
This dataset contains 10 classes and is balanced (exactly 10,000 per class).
The classes include subcategories of computer science, physics, and math.

• Direct link: Download

• Citation:

@inproceedings{farhangi2022protoformer,
  title={Protoformer: Embedding Prototypes for Transformers},
  author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan},
  booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part I},
  pages={447--458},
  year={2022}
}

Variants: arXiv-10

Associated Benchmarks

This dataset is used in 1 benchmark:

Text Classification - Metrics: Accuracy

Recent Benchmark Submissions

Task	Model	Paper	Date
Text Classification	Protoformer	Protoformer: Embedding Prototypes for Transformers	2022-06-25
Text Classification	RoBERTa	RoBERTa: A Robustly Optimized BERT …	2019-07-26
Text Classification	DocBERT	DocBERT: BERT for Document Classification	2019-04-17

Research Papers

Recent papers with results on this dataset:

Protoformer: Embedding Prototypes for Transformers (2022) -
RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019) -
DocBERT: BERT for Document Classification (2019) -

External Links:

arXiv-10

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview