DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
DEplain-web-sent consists of approx. 150 aligned documents and approx. 2k manually aligned sentence pairs. The data is publicly available (see licenses). The corpus include texts from the following domains: fictional texts (literature and fairy tales), bible texts, health-related texts, and texts for language learners. The corpus can be used for German text simplification, or in more detail sentence simplification. The corpus is also available on Huggingface: https://huggingface.co/datasets/DEplain/DEplain-web-sent.
Variants: DEplain-web-sent
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Text Simplification | mBART (trained on DEplain-APA-sent & DEplain-web-sent) | DEPLAIN: A German Parallel Corpus … | 2023-05-30 |
Text Simplification | mBART (trained on DEplain-APA-sent) | DEPLAIN: A German Parallel Corpus … | 2023-05-30 |
Recent papers with results on this dataset: