ML Research Wiki / Benchmarks / Document Image Classification / RVL-CDIP

RVL-CDIP

Document Image Classification Benchmark

Performance Over Time

📊 Showing 29 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Roberta base	RoBERTa: A Robustly Optimized BERT Pretraining Approach	90.06	2019-07-26	📦 huggingface/transformers 📦 pytorch/fairseq 📦 PaddlePaddle/PaddleNLP
2	EAML	EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification	0.00	2023-05-11	-
3	DocFormerBASE	DocFormer: End-to-End Transformer for Document Understanding	0.00	2021-06-22	📦 shabie/docformer
4	LayoutLMV3Large	LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking	0.00	2022-04-18	📦 huggingface/transformers 📦 microsoft/unilm 📦 pwc-1/Paper-9 📦 MindSpore-scientific-2/code-14
5	LiLT[EN-R]BASE	LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding	0.00	2022-02-28	📦 huggingface/transformers 📦 jpwang/lilt 📦 pwc-1/Paper-9 📦 MindSpore-scientific-2/code-14 📦 MS-P3/code3
6	LayoutLMv2LARGE	LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding	0.00	2020-12-29	📦 huggingface/transformers 📦 PaddlePaddle/PaddleOCR 📦 microsoft/unilm
7	TILT-Large	Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer	0.00	2021-02-18	📦 uakarsh/TiLT-Implementation
8	DocFormer large	DocFormer: End-to-End Transformer for Document Understanding	0.00	2021-06-22	📦 shabie/docformer
9	LayoutLMv3BASE	LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking	0.00	2022-04-18	📦 huggingface/transformers 📦 microsoft/unilm 📦 pwc-1/Paper-9 📦 MindSpore-scientific-2/code-14
10	Donut	OCR-free Document Understanding Transformer	0.00	2021-11-30	📦 clovaai/donut 📦 impira/docquery 📦 MindCode-4/code-3 📦 code-implementation1/Code9 📦 2023-MindSpore-1/ms-code-2

All Papers (29)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

Roberta base

huggingface/transformers pytorch/fairseq

EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification

2023

EAML

DocFormer: End-to-End Transformer for Document Understanding

2021

DocFormerBASE

shabie/docformer

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

2022

LayoutLMV3Large

huggingface/transformers microsoft/unilm

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

2022

LiLT[EN-R]BASE

huggingface/transformers jpwang/lilt

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

2020

LayoutLMv2LARGE

huggingface/transformers PaddlePaddle/PaddleOCR

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

2021

TILT-Large

uakarsh/TiLT-Implementation

DocFormer: End-to-End Transformer for Document Understanding

2021

DocFormer large

shabie/docformer

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

2022

LayoutLMv3BASE

huggingface/transformers microsoft/unilm

OCR-free Document Understanding Transformer

2021

Donut

clovaai/donut impira/docquery

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

2021

TILT-Base

uakarsh/TiLT-Implementation

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

2020

LayoutLMv2BASE

huggingface/transformers PaddlePaddle/PaddleOCR

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

2021

LayoutXLM

huggingface/transformers PaddlePaddle/PaddleOCR

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

2023

StrucTexTv2 (large)

PaddlePaddle/VIMER

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

2019

Pre-trained LayoutLM

huggingface/transformers PaddlePaddle/PaddleOCR

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

2024

DoPTA

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

2023

StrucTexTv2 (small)

PaddlePaddle/VIMER

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

2022

VLCDoC

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

2023

TransferDoc

Multimodal Side-Tuning for Document Classification

2023

Multimodal (ResNet50)

thezingaro/multimodal-side-tuning

DiT: Self-supervised Pre-training for Document Image Transformer

2022

DiT-L

huggingface/transformers microsoft/unilm

Improving accuracy and speeding up Document Image Classification through parallel systems

2020

Pre-trained EfficientNet

javiferran/document-classification

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

2018

Transfer Learning from VGG16 trained on Imagenet

microsoft/unilm BordiaS/layoutlm

Multimodal Side-Tuning for Document Classification

2023

Multimodal (MobileNetV2)

thezingaro/multimodal-side-tuning

DiT: Self-supervised Pre-training for Document Image Transformer

2022

DiT-B

huggingface/transformers microsoft/unilm

BEiT: BERT Pre-Training of Image Transformers

2021

BEiT-B

huggingface/transformers rwightman/pytorch-image-models

Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification

2017

Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50

microsoft/unilm BordiaS/layoutlm

Analysis of Convolutional Neural Networks for Document Image Classification

2017

AlexNet + spatial pyramidal pooling + image resizing

Training data-efficient image transformers & distillation through attention

2020

DeiT-B

huggingface/transformers rwightman/pytorch-image-models

RVL-CDIP

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (29)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification

DocFormer: End-to-End Transformer for Document Understanding

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

DocFormer: End-to-End Transformer for Document Understanding

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

OCR-free Document Understanding Transformer

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Multimodal Side-Tuning for Document Classification

DiT: Self-supervised Pre-training for Document Image Transformer

Improving accuracy and speeding up Document Image Classification through parallel systems

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

Multimodal Side-Tuning for Document Classification

DiT: Self-supervised Pre-training for Document Image Transformer

BEiT: BERT Pre-Training of Image Transformers

Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification

Analysis of Convolutional Neural Networks for Document Image Classification

Training data-efficient image transformers & distillation through attention

Model	Paper	Accuracy	Date
Roberta base	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	90.06	2019-07-26
EAML	EAML: Ensemble Self-Attention-based Mutual Learni…		2023-05-11
DocFormerBASE	DocFormer: End-to-End Transformer for Document Un…		2021-06-22
LayoutLMV3Large	LayoutLMv3: Pre-training for Document AI with Uni…		2022-04-18
LiLT[EN-R]BASE	LiLT: A Simple yet Effective Language-Independent…		2022-02-28
LayoutLMv2LARGE	LayoutLMv2: Multi-modal Pre-training for Visually…		2020-12-29
TILT-Large	Going Full-TILT Boogie on Document Understanding …		2021-02-18
DocFormer large	DocFormer: End-to-End Transformer for Document Un…		2021-06-22
LayoutLMv3BASE	LayoutLMv3: Pre-training for Document AI with Uni…		2022-04-18
Donut	OCR-free Document Understanding Transformer		2021-11-30
TILT-Base	Going Full-TILT Boogie on Document Understanding …		2021-02-18
LayoutLMv2BASE	LayoutLMv2: Multi-modal Pre-training for Visually…		2020-12-29
LayoutXLM	LayoutXLM: Multimodal Pre-training for Multilingu…		2021-04-18
StrucTexTv2 (large)	StrucTexTv2: Masked Visual-Textual Prediction for…		2023-03-01
Pre-trained LayoutLM	LayoutLM: Pre-training of Text and Layout for Doc…		2019-12-31
DoPTA	DoPTA: Improving Document Layout Analysis using P…		2024-12-17
StrucTexTv2 (small)	StrucTexTv2: Masked Visual-Textual Prediction for…		2023-03-01
VLCDoC	VLCDoC: Vision-Language Contrastive Pre-Training …		2022-05-24
TransferDoc	GlobalDoc: A Cross-Modal Vision-Language Framewor…		2023-09-11
Multimodal (ResNet50)	Multimodal Side-Tuning for Document Classification		2023-01-16
DiT-L	DiT: Self-supervised Pre-training for Document Im…		2022-03-04
Pre-trained EfficientNet	Improving accuracy and speeding up Document Image…		2020-06-16
Transfer Learning from VGG16 trained on Imagenet	Document Image Classification with Intra-Domain T…		2018-01-29
Multimodal (MobileNetV2)	Multimodal Side-Tuning for Document Classification		2023-01-16
DiT-B	DiT: Self-supervised Pre-training for Document Im…		2022-03-04
BEiT-B	BEiT: BERT Pre-Training of Image Transformers		2021-06-15
Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50	Cutting the Error by Half: Investigation of Very …		2017-04-11
AlexNet + spatial pyramidal pooling + image resizing	Analysis of Convolutional Neural Networks for Doc…		2017-08-10
DeiT-B	Training data-efficient image transformers & dist…		2020-12-23