The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.
Markers prediction can be used in order to train a sentence encoders. Discourse markers can be considered as noisy labels for various semantic tasks, such as entailment (y=therefore), subjectivity analysis (y=personally) or sentiment analysis (y=sadly), similarity (y=similarly), typicality, (y=curiously) ...
The specificity of this dataset is the diversity of the markers, since previously used data used only ~10 imbalanced classes. The author of the dataset provide:
Source: GitHub
Variants: Discovery
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Relation Classification | BERT | Mining Discourse Markers for Unsupervised … | 2019-03-28 |
Recent papers with results on this dataset: