Discovery

Name: Discovery
Published: 2019-03-28
License: Apache 2.0

Dataset Information

Modalities

Texts

Languages

English

Introduced

2019

License

Apache 2.0

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.

Markers prediction can be used in order to train a sentence encoders. Discourse markers can be considered as noisy labels for various semantic tasks, such as entailment (y=therefore), subjectivity analysis (y=personally) or sentiment analysis (y=sadly), similarity (y=similarly), typicality, (y=curiously) ...

The specificity of this dataset is the diversity of the markers, since previously used data used only ~10 imbalanced classes. The author of the dataset provide:

a list of the 174 discourse markers
a Base version of the dataset with 1.74 million pairs (10k examples per marker)
a Big version with 3.4 million pairs
a Hard version with 1.74 million pairs where the connective couldn't be predicted with a fastText linear model

Source: GitHub

Variants: Discovery

Associated Benchmarks

This dataset is used in 1 benchmark:

Relation Classification - Metrics: 1:1 Accuracy

Recent Benchmark Submissions

Task	Model	Paper	Date
Relation Classification	BERT	Mining Discourse Markers for Unsupervised …	2019-03-28

Research Papers

Recent papers with results on this dataset:

Mining Discourse Markers for Unsupervised Sentence Representation Learning (2019) -

External Links:

Discovery

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview