DEplain-web-doc

Dataset Information
Modalities
Texts
Languages
German
Introduced
2023
Homepage

Overview

DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts

DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

DEplain-web-doc consists of approx. 150 aligned documents. The data is publicly available (see licenses). The corpus includes texts from the following domains: fictional texts (literature and fairy tales), bible texts, health-related texts, texts for language learners, texts for accessibility, and public administration texts. The corpus can be used for German text simplification, or in more detail document simplification. The corpus is also available on Huggingface: see https://huggingface.co/datasets/DEplain/DEplain-web-doc.

Variants: DEplain-web-doc

Associated Benchmarks

This dataset is used in 1 benchmark:

  • Text Simplification -

Recent Benchmark Submissions

Task Model Paper Date
Text Simplification long-mBART (trained on DEplain-APA-doc & DEplain-web-doc) DEPLAIN: A German Parallel Corpus … 2023-05-30
Text Simplification long-mBART (trained on DEplain-web-doc) DEPLAIN: A German Parallel Corpus … 2023-05-30
Text Simplification long-mBART (trained on DEplain-APA-doc) DEPLAIN: A German Parallel Corpus … 2023-05-30

Research Papers

Recent papers with results on this dataset: