EC-FUNSD

Name: EC-FUNSD
Published: 2024-02-04
License: CC-BY-4.0

Dataset Information

Modalities

Images, Texts

Languages

English

Introduced

2024

License

CC-BY-4.0

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the entity-centric robustness evaluation of pre-trained text-and-layout models (PTLMs).

In practical applications of document intelligence, PTLMs (e.g. the LayoutLM series) generally serve as the encoder of document layouts, similar to the role played by pre-trained contextualized language models (e.g. the BERT series) in NLP tasks.
Conventionally, information extraction (IE) ability of PTLMs is evaluated via SER and EL, esp. in a sequence-labeling manner.
The performance of PTLMs on these tasks reflects the capacity of their layout embeddings to facilitate downstream IE tasks.
However, the prevailing benchmarks do not fully conform to the aforementioned evaluation pipeline, thereby diminishing the reliability of the assessment. Take FUNSD as an example, its block-level annotation falsely couples the annotations of segment and entity, which does not adequately represent semantic-driven entities and hinder the fair evaluation.

The propose of EC-FUNSD aims to provide a fair and unbiased evaluation benchmark of IE ability of PTLMs.
The construction of this dataset includes the revision of layout and IE annotations from FUNSD.
First, the original layout annotation of FUNSD is cleaned, and multiple-row blocks are split into row-wise segments.
Second, the semantic entities are re-annotated together with their linking relationships, with the segment order preserved to ensure that each entity is represented as a continuous word span within layout, making the form of this dataset suitable for sequence-labeling models.
The final dataset consists of 199 document samples including the image, layout annotation of segments and words, and labeled entities of 3 categories.
For the detailed annotation process and statistics, please refer to the original paper.

Variants: EC-FUNSD

Associated Benchmarks

This dataset is used in 2 benchmarks:

Entity Linking - Metrics: F1
Semantic entity labeling - Metrics: F1

Recent Benchmark Submissions

Task	Model	Paper	Date
Entity Linking	RORE (GeoLayoutLM)	Modeling Layout Reading Order as …	2024-09-29
Entity Linking	RORE (LayoutLMv3-large)	Modeling Layout Reading Order as …	2024-09-29
Entity Linking	RORE (LayoutLMv3-base)	Modeling Layout Reading Order as …	2024-09-29
Semantic entity labeling	RORE (LayoutLMv3-large)	Modeling Layout Reading Order as …	2024-09-29
Semantic entity labeling	RORE (GeoLayoutLM)	Modeling Layout Reading Order as …	2024-09-29
Semantic entity labeling	RORE (LayoutLMv3-base)	Modeling Layout Reading Order as …	2024-09-29
Semantic entity labeling	GeoLayoutLM	Rethinking the Evaluation of Pre-trained …	2024-02-04
Semantic entity labeling	LayoutLMv3 (large)	Rethinking the Evaluation of Pre-trained …	2024-02-04
Entity Linking	GeoLayoutLM	Rethinking the Evaluation of Pre-trained …	2024-02-04
Entity Linking	LayoutLMv3 (large)	Rethinking the Evaluation of Pre-trained …	2024-02-04
Entity Linking	LayoutLMv3 (base)	Rethinking the Evaluation of Pre-trained …	2024-02-04
Semantic entity labeling	LayoutLMv3 (base)	Rethinking the Evaluation of Pre-trained …	2024-02-04
Entity Linking	LayoutLMv3 (base)	LayoutLMv3: Pre-training for Document AI …	2022-04-18
Semantic entity labeling	LayoutLMv3 (base)	LayoutLMv3: Pre-training for Document AI …	2022-04-18
Entity Linking	LayoutLMv3 (large)	LayoutLMv3: Pre-training for Document AI …	2022-04-18
Semantic entity labeling	LayoutLMv3 (large)	LayoutLMv3: Pre-training for Document AI …	2022-04-18

Research Papers

Recent papers with results on this dataset:

External Links:

EC-FUNSD

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview