ConvAI2

Conversational Intelligence Challenge 2

Dataset Information
Modalities
Texts, Dialog
Languages
English
Introduced
2019
License
Homepage

Overview

The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.

To avoid modeling that takes advantage of trivial word overlap, additional rewritten sets of the same train and test personas were crowdsourced, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. For example “I just got my nails done” is revised as “I love to pamper myself on a regular basis” and “I am on a diet now” is revised as “I need to lose weight.”

The training, validation and hidden test sets consists of 17,878, 1,000 and 1,015 dialogues, respectively.

Source: The Second Conversational Intelligence Challenge (ConvAI2)
Image Source: The Second Conversational Intelligence Challenge (ConvAI2)

Variants: ConvAI2

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Visual Dialog Multi-Modal BlenderBot Multi-Modal Open-Domain Dialogue 2020-10-02

Research Papers

Recent papers with results on this dataset: