List of Russian Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Russian language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Russian NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Russian datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 26 Russian Datasets

Let’s get started!

Russian Commitment Bank (RCB) (SuperGlue)
Dataset is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
Choice of Plausible Alternatives for Russian language (PARus) (SuperGlue)
Dataset is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
Russian Multi-Sentence Reading Comprehension (MuSeRC) (SuperGlue)
Dataset for question answering in which questions can only be answered by taking into account information from multiple sentences. It contains approximately 6,000 questions for more than 800 paragraphs across 5 different domains, namely: 1) elementary school texts, 2) news, 3) fiction stories, 4) fairy tales, 5) brief annotations of TV series and books.
Textual Entailment Recognition for Russian (TERRa) (SuperGlue)
This task requires to recognize, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.
Words in Context (RUSSe) (SuperGlue)
Given two sentences and a polysemous word, which occurs in both sentences, the task is to determine, whether the word is used in the same sense in both sentences, or not.
The Winograd Schema Challenge Russian (RWSD) (SuperGlue)
Dataset is constructed as translation of the English Winograd Schema Challenge. The task consists of a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
DaNetQA (SuperGlue)
Dataset is a question-answering corpus comprising of natural yes/no questions.
Russian Reading Comprehension with Commonsense reasoning (RuCoS) (SuperGlue)
Dataset consists of passages and cloze-style queries automatically generated from Russian news articles, namely Lenta4 and Deutsche Welle5. Dataset is used for commonsense reasoning modeled after the ReCord dataset.
CC100-Russian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
RuBQ
Dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels.
SberQuAD
Dataset consists of a question answers modeleld after SQuAD.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
Russian Commitment Bank (RCB) (SuperGlue)
Dataset is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
Choice of Plausible Alternatives for Russian language (PARus) (SuperGlue)
Dataset is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
Russian Multi-Sentence Reading Comprehension (MuSeRC) (SuperGlue)
Dataset for question answering in which questions can only be answered by taking into account information from multiple sentences. It contains approximately 6,000 questions for more than 800 paragraphs across 5 different domains, namely: 1) elementary school texts, 2) news, 3) fiction stories, 4) fairy tales, 5) brief annotations of TV series and books.
Textual Entailment Recognition for Russian (TERRa) (SuperGlue)
This task requires to recognize, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.
Words in Context (RUSSe) (SuperGlue)
Given two sentences and a polysemous word, which occurs in both sentences, the task is to determine, whether the word is used in the same sense in both sentences, or not.
The Winograd Schema Challenge Russian (RWSD) (SuperGlue)
Dataset is constructed as translation of the English Winograd Schema Challenge. The task consists of a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
DaNetQA (SuperGlue)
Dataset is a question-answering corpus comprising of natural yes/no questions.
Russian Reading Comprehension with Commonsense reasoning (RuCoS) (SuperGlue)
Dataset consists of passages and cloze-style queries automatically generated from Russian news articles, namely Lenta4 and Deutsche Welle5. Dataset is used for commonsense reasoning modeled after the ReCord dataset.
CC100-Russian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
RuBQ
Dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels.
SberQuAD
Dataset consists of a question answers modeleld after SQuAD.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.