Discovery

Dataset Information
Modalities
Texts
Languages
English
Introduced
2019
License
Apache 2.0
Homepage

Overview

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.

Markers prediction can be used in order to train a sentence encoders. Discourse markers can be considered as noisy labels for various semantic tasks, such as entailment (y=therefore), subjectivity analysis (y=personally) or sentiment analysis (y=sadly), similarity (y=similarly), typicality, (y=curiously) ...

The specificity of this dataset is the diversity of the markers, since previously used data used only ~10 imbalanced classes. The author of the dataset provide:

  • a list of the 174 discourse markers
  • a Base version of the dataset with 1.74 million pairs (10k examples per marker)
  • a Big version with 3.4 million pairs
  • a Hard version with 1.74 million pairs where the connective couldn't be predicted with a fastText linear model

Source: GitHub

Variants: Discovery

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Relation Classification BERT Mining Discourse Markers for Unsupervised … 2019-03-28

Research Papers

Recent papers with results on this dataset: