HierText

Name: HierText
Published: 2022-06-03
License: Unknown

Dataset Information

Introduced

2022

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

HierText is the first dataset featuring hierarchical annotations of text in natural scenes and documents. The dataset contains 11639 images selected from the Open Images dataset, providing high quality word (~1.2M), line, and paragraph level annotations. Text lines are defined as connected sequences of words that are aligned in spatial proximity and are logically connected. Text lines that belong to the same semantic topic and are geometrically coherent form paragraphs. Images in HierText are rich in text, with average of more than 100 words per image.

Variants: HierText

Associated Benchmarks

This dataset is used in 1 benchmark:

Hierarchical Text Segmentation - Metrics: F-score (average), F-score (stroke), F-score (word), F-score (text-line), F-score (para., layout)

Recent Benchmark Submissions

Task	Model	Paper	Date
Hierarchical Text Segmentation	Hi-SAM	Hi-SAM: Marrying Segment Anything Model …	2024-01-31

Research Papers

Recent papers with results on this dataset:

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation (2024) -

External Links:

HierText

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview