EC-FUNSD

Dataset Information
Modalities
Images, Texts
Languages
English
Introduced
2024
License
Homepage

Overview

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the entity-centric robustness evaluation of pre-trained text-and-layout models (PTLMs).

In practical applications of document intelligence, PTLMs (e.g. the LayoutLM series) generally serve as the encoder of document layouts, similar to the role played by pre-trained contextualized language models (e.g. the BERT series) in NLP tasks.
Conventionally, information extraction (IE) ability of PTLMs is evaluated via SER and EL, esp. in a sequence-labeling manner.
The performance of PTLMs on these tasks reflects the capacity of their layout embeddings to facilitate downstream IE tasks.
However, the prevailing benchmarks do not fully conform to the aforementioned evaluation pipeline, thereby diminishing the reliability of the assessment. Take FUNSD as an example, its block-level annotation falsely couples the annotations of segment and entity, which does not adequately represent semantic-driven entities and hinder the fair evaluation.

The propose of EC-FUNSD aims to provide a fair and unbiased evaluation benchmark of IE ability of PTLMs.
The construction of this dataset includes the revision of layout and IE annotations from FUNSD.
First, the original layout annotation of FUNSD is cleaned, and multiple-row blocks are split into row-wise segments.
Second, the semantic entities are re-annotated together with their linking relationships, with the segment order preserved to ensure that each entity is represented as a continuous word span within layout, making the form of this dataset suitable for sequence-labeling models.
The final dataset consists of 199 document samples including the image, layout annotation of segments and words, and labeled entities of 3 categories.
For the detailed annotation process and statistics, please refer to the original paper.

Variants: EC-FUNSD

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Entity Linking RORE (GeoLayoutLM) Modeling Layout Reading Order as … 2024-09-29
Entity Linking RORE (LayoutLMv3-large) Modeling Layout Reading Order as … 2024-09-29
Entity Linking RORE (LayoutLMv3-base) Modeling Layout Reading Order as … 2024-09-29
Semantic entity labeling RORE (LayoutLMv3-large) Modeling Layout Reading Order as … 2024-09-29
Semantic entity labeling RORE (GeoLayoutLM) Modeling Layout Reading Order as … 2024-09-29
Semantic entity labeling RORE (LayoutLMv3-base) Modeling Layout Reading Order as … 2024-09-29
Semantic entity labeling GeoLayoutLM Rethinking the Evaluation of Pre-trained … 2024-02-04
Semantic entity labeling LayoutLMv3 (large) Rethinking the Evaluation of Pre-trained … 2024-02-04
Entity Linking GeoLayoutLM Rethinking the Evaluation of Pre-trained … 2024-02-04
Entity Linking LayoutLMv3 (large) Rethinking the Evaluation of Pre-trained … 2024-02-04
Entity Linking LayoutLMv3 (base) Rethinking the Evaluation of Pre-trained … 2024-02-04
Semantic entity labeling LayoutLMv3 (base) Rethinking the Evaluation of Pre-trained … 2024-02-04
Entity Linking LayoutLMv3 (base) LayoutLMv3: Pre-training for Document AI … 2022-04-18
Semantic entity labeling LayoutLMv3 (base) LayoutLMv3: Pre-training for Document AI … 2022-04-18
Entity Linking LayoutLMv3 (large) LayoutLMv3: Pre-training for Document AI … 2022-04-18
Semantic entity labeling LayoutLMv3 (large) LayoutLMv3: Pre-training for Document AI … 2022-04-18

Research Papers

Recent papers with results on this dataset: