The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.
The UrduDoc dataset is the first of its kind for printed Urdu text line detection and will advance research in the field. It will be made publicly available for non-commercial, academic, and research purposes upon request and execution of a no-cost license agreement. To request the dataset and for more information and details about the UrduDoc , UTRSet-Real & UTRSet-Synth datasets, please refer to the Project Website of our paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"
Variants: UrduDoc
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Text Detection | ContourNet [69] | UTRNet: High-Resolution Urdu Text Recognition … | 2023-06-27 |
Text Detection | DRRG [72] | UTRNet: High-Resolution Urdu Text Recognition … | 2023-06-27 |
Text Detection | PSENet [67] | UTRNet: High-Resolution Urdu Text Recognition … | 2023-06-27 |
Text Detection | EAST [75] | UTRNet: High-Resolution Urdu Text Recognition … | 2023-06-27 |
Text Detection | EAST | UTRNet: High-Resolution Urdu Text Recognition … | 2023-06-27 |
Recent papers with results on this dataset: