CATT

CATT Arabic Diacritization Benchmark Dataset

Dataset Information
Modalities
Texts
Languages
Arabic
Introduced
2024
License
Homepage

Overview

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023.
It covers multiple topics including science and technology, economics, politics, sports, arts, and culture.
It was manually diacritized by two expert native Arabic speakers and then validated by a third expert.
This dataset contains names of people and places in both Arabic and English.
As for the English names, they are written in Arabic letters and diacritized based on their pronunciation.
Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).

Variants: CATT

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Arabic Text Diacritization Multilevel Diacritizer CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization CATT ED CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization GPT-4 CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization CBHG CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization Command R+ CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization CATT EO CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization Shakkala CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization Sakhr CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization Alkhalil CATT: Character-based Arabic Tashkeel Transformer 2024-07-03
Arabic Text Diacritization Deep Diacritization (D3) Deep Diacritization: Efficient Hierarchical Recurrence … 2020-11-01
Arabic Text Diacritization Deep Diacritization (D2) Deep Diacritization: Efficient Hierarchical Recurrence … 2020-11-01

Research Papers

Recent papers with results on this dataset: