PubTabNet

Dataset Information
Introduced
2019
License
Unknown
Homepage

Overview

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables. The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in our paper "Image-based table recognition: data, model, and evaluation".

Variants: PubTabNet

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Table Recognition MuTabNet Multi-Cell Decoder and Mutual Learning … 2024-04-20
Table Recognition ConvStem High-Performance Transformers for Table Structure … 2023-11-09
Table Recognition Multi-Task Learning Model An End-to-End Multi-Task Learning Model … 2023-03-15
Table Recognition SLANet PP-StructureV2: A Stronger Document Analysis … 2022-10-11
Table Recognition TRUST TRUST: An Accurate and End-to-End … 2022-08-31
Table Recognition TSRFormer TSRFormer: Table Structure Recognition with … 2022-08-09
Table Recognition RTSR Robust Table Detection and Structure … 2022-03-17
Table Recognition NCGM Neural Collaborative Graph Machines for … 2021-11-26
Table Recognition SEM Split, embed and merge: An … 2021-07-12
Table Recognition LGPMA LGPMA: Complicated Table Structure Recognition … 2021-05-13
Table Recognition TableMaster PingAn-VCGroup's Solution for ICDAR 2021 … 2021-05-05
Table Recognition TabStruct-Net Table Structure Recognition using Top-Down … 2020-10-09
Table Recognition EDD Image-based table recognition: data, model, … 2019-11-25

Research Papers

Recent papers with results on this dataset: