CheGeKa

Dataset Information
Modalities
Texts
Languages
Russian
Introduced
2022
License
Homepage

Overview

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.

Motivation

The task can be considered the most challenging in terms of reasoning, knowledge and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.

The original corpus of the CheGeKa game was introduced in Mikhalkova (2021).

An example in English for illustration purposes:

```{
'question_id': 3665,

'question': 'THIS MAN replaced John Lennon when the Beatles got together for the last time.',

'answer': 'Julian Lennon',

'topic': 'The Liverpool Four',

'author': 'Bayram Kuliyev',

'tour_name': 'Jeopardy!. Ashgabat-1996',

'tour_link': 'https://db.chgk.info/tour/ash96sv',

'episode': [16],

'perturbation': 'chegeka'

}```

Data Fields

  • question_id: an integer corresponding to the question id in the database
  • question: a string containing the question text
  • answer: a string containing the correct answer to the question
  • topic: a string containing the question category
  • author: a string with the full name of the author
  • tour_name: a string with the title of a tournament
  • tour link: a string containing the link to a tournament (None for the test set)
  • perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
  • episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

  • raw data: includes the original data with no additional sampling
  • episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

  • ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
  • Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
  • EDAdelete: randomly deletes tokens in the text
  • EDAswap: randomly swaps tokens in the text
  • BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
  • AddSent: generates extra words or a sentence at the end of the question

Variants: CheGeKa

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Question Answering Human benchmark TAPE: Assessing Few-shot Russian Language … 2022-10-23
Question Answering RuGPT-3 Large TAPE: Assessing Few-shot Russian Language … 2022-10-23
Question Answering RuGPT-3 Medium TAPE: Assessing Few-shot Russian Language … 2022-10-23
Question Answering RuGPT-3 Small TAPE: Assessing Few-shot Russian Language … 2022-10-23

Research Papers

Recent papers with results on this dataset: