HierText

Dataset Information
Introduced
2022
License
Unknown
Homepage

Overview

HierText is the first dataset featuring hierarchical annotations of text in natural scenes and documents. The dataset contains 11639 images selected from the Open Images dataset, providing high quality word (~1.2M), line, and paragraph level annotations. Text lines are defined as connected sequences of words that are aligned in spatial proximity and are logically connected. Text lines that belong to the same semantic topic and are geometrically coherent form paragraphs. Images in HierText are rich in text, with average of more than 100 words per image.

Variants: HierText

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Hierarchical Text Segmentation Hi-SAM Hi-SAM: Marrying Segment Anything Model … 2024-01-31

Research Papers

Recent papers with results on this dataset: