List of Multi lingual Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Multi lingual language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Multi lingual NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.

Custom fine-tune with Multi lingual datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️ Try for free

Found 172 Multi lingual Datasets

Let’s get started!

Dataset contains 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.

Dataset is the retrieval version of the normal XQuAD dataset. Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages.

Dataset extends the WiC dataset containing 80K instances to 12 new languages: Bulgarian, Chinese, Croatian, Danish, Dutch, Estonian, Farsi, French, German, Italian, Japanese and Korean.

A benchmark dataset for cross-lingual and multi-lingual question answering for high school examinations. We collected more than 24,000 high quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences. Langs: Albanian, Arabic, Bulgarian, Croatian, French, German, Hungarian, Italian, Lithuanian, Macedonian, Polish, Portuguese, Serbian, Spanish, Turkish, and Vietnamese.

Dataset consists of emotion annotated movie subtitles from OPUS. Plutchik's 8 core emotions to annotate were used. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines).

Dataset is a collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. It comprises of (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, 49 million unique queries and 34 billion (query, document, label) triplets were mined.

Dataset consists of entity mentions linked to WikiData, extracted from WikiNews articles. It covers 9 diverse languages, 5 language families and 6 writing systems. It features many WikiData entities that do not appear in English Wikipedia, thereby incentivizing research into multilingual entity linking against WikiData at-large. Langs: Japanese, German, Spanish, Arabic, Serbian, Turkish, Persian, Tamil & English.

Dataset is a multi-lingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

RELX & RELX-Distant

Two datasets for cross-lingual relation classification are included: RELX and RELXDistant. RELX contains 502 parallel sentences per language (total of 5 languages) with 18 relations with direction and no_relation in total of 37 categories. RELX-Distant was extracted from Wikipedia & Wikidata.

Dataset contains 100k annotated utterances in 6 languages (English Germany French Spanish Hindi Thai) across 11 domains. Dataset contains a mix of both simple as well as compositional nested queries across 11 domains, 117 intents and 78 slots.

Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.

News Commentary Parallel Corpus

Dataset consists of parallel corpora consisting of political and economic commentary crawled from the web site Project Syndicate.

Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.

Multilingual Knowledge Questions & Answers (MKQA)

Dataset is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total).

Dataset contains phoneme-level alignments for more than 600 languages, high-resource alignments for ~50 languages, and phonetic measures for all vowels and sibilants. Consists of 690 audio readings of the New Testament of the Bible.

Dataset is parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average.

Dataset is a collection of Quran translations in 42 languages.

Dataset is a collection of sentences and translations.

Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.

COVID-19 Twitter Chatter Dataset

Dataset contains over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st, 2020 to present.

A knowledge graph that connects words and phrases of natural language (terms) with labeled, weighted edges (assertions).

Dataset was collected from online newspapers, it contains 1.5M+ article/summary pairs in 5 languages: French, German, Spanish, Russian, & Turkish.

Cross-lingual Choice of Plausible Alternatives (XCOPA)

Dataset is the translation and reannotation of the English COPA and covers 11 languages: Estonian, Haitian Creole, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese & Mandarin Chinese. The dataset requires both the command of world knowledge and the ability to generalise to new languages.

ParaCrawl Corpus

Multiple parallel datasets of European languages for machine translation.

Leipzig Corpora Collection

Dataset containing 252 languages of web crawled news corpora.

The EUR-Lex Dataset

Dataset is a collection of documents about European Union law. It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed with almost 4,000 labels.

Gutenberg Dialogue

A dataset created by extracting dialogue from the Gutenberg book collection, comprising of ~60,000 books. Currently it supports English, German, Dutch, Spanish, Portuguese, Italian, and Hungarian.

Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian.

The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.

The Cross-lingual Natural Language Inference corpus (XNLI)

Dataset contains collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.

Dataset contains more than 150 political questions, and 67k comments written by candidates on those questions. The questions are available in German, French, Italian and English.

Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)

Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.

MultiLing Pilot 2011 Dataset

Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi.

Train-O-Matic Large

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Train-O-Matic Small

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.

Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.

Dataset consists of a subset of 240 context paragraphs and 1,190 question-answer pairs from the development set of SQuAD v1.1 with their translations in 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.

A parallel corpus created from translations of the Bible containing 102 languages.

Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset.

Website and documentation from the European Central Bank. Contains 19 languages.

A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages.

Corpus of documents from the EU bookshop. Contains 48 languages.

Code-Mixed-Dialog

A goal-oriented dialog dataset containing code-mixed conversations. Specifically, text from the DSTC2 restaurant reservation dataset and create code-mixed versions of it in Hindi-English, Bengali-English, Gujarati-English and Tamil-English.

A Novel Approach to a Semantically-Aware Representation of Items (NASARI)

Dataset contains semantic vector representations for BabelNet synsets and Wikipedia pages in several languages: English, Spanish, French, German and Italian. Currently available three vector types: lexical, unified and embedded.

4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset.

Densely Annotated Wikipedia Texts (DAWT)

Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.

Europeana Newspapers

Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format.

The NewsReader MEANTIME Corpus

480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference.

Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.

Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens.

Global Voices Parallel Corpus

Dataset contains news articles from the web site Global Voices in multiple languages.

IWSLT'15 English-Vietnamese

Parallel corpus used for machine translation English-Vietnamese.

MultiLingual Question Answering (MLQA)

Dataset for evaluating cross-lingual question answering performance. ~12K QA instances in English and 5K in each other language in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese.

Parallel Meaning Bank

Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages.

TyDi QA includes question-answer pairs from 11 languages: Arabic, Bengali, English, Finnish, Indonesian, Kiswahili, Russian. Japanese, Korean, Thai, and Telugu.

United Nations Parallel Corpus

Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.

Open Super-Large Crawled Almanach Corpus (OSCAR)

Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available.

Dataset of multi-lingual dialogs from movie scripts. Includes 62 languages.

Web Inventory of Transcribed and Translated Talks (WIT3)

Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages.

The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. Includes English, Russian and Turkish.

Dataset contains various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

Paraphrase Adversaries from Word Scrambling (PAWS-X)

Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Dataset containing audio in 29 languages and 2,454 recorded hours .

Dataset contains data from 25 billion web pages.

The English version of the DBpedia knowledge base currently describes 6.6M entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M resources are classified in a consistent ontology.

DSL Corpus Collection (DSLCC)

Dataset contains short excerpts of journalistic texts in similar languages and dialects.

European Parliament Proceedings (Europarl)

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.

Google Books N-grams

N-grams from a very large corpus of books.

Guttenberg Book Corpus

Dataset contains 60,000 eBooks.

IWSLT 15 English-Vietnamese

Sentence pairs for translation.

Microsoft Speech Language Translation Corpus (MSLT)

Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.

One Week of Global News Feeds

Dataset contains most of the new news content published online over one week in 2017 and 2018.

The Winograd Schema Challenge

Dataset to determine the correct referrent of the pronoun from among the provided choices.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

One Week of Global News Feeds

Dataset contains most of the new news content published online over one week in 2017 and 2018.

The Winograd Schema Challenge

Dataset to determine the correct referrent of the pronoun from among the provided choices.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

WMT 14 English-German

Sentence pairs for translation.

WMT 15 English-Czech

Sentence pairs for translation.

WMT 19 Multiple Datasets

Multiple text corpora in multiple languages.

Worldwide News - Aggregate of 20K Feeds

One week snapshot of all online headlines in 20+ languages.

Dataset is a multi-lingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

Dataset contains 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.

Dataset is the retrieval version of the normal XQuAD dataset. Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages.

Dataset extends the WiC dataset containing 80K instances to 12 new languages: Bulgarian, Chinese, Croatian, Danish, Dutch, Estonian, Farsi, French, German, Italian, Japanese and Korean.

A benchmark dataset for cross-lingual and multi-lingual question answering for high school examinations. We collected more than 24,000 high quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences. Langs: Albanian, Arabic, Bulgarian, Croatian, French, German, Hungarian, Italian, Lithuanian, Macedonian, Polish, Portuguese, Serbian, Spanish, Turkish, and Vietnamese.

Dataset consists of emotion annotated movie subtitles from OPUS. Plutchik's 8 core emotions to annotate were used. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines).

Dataset is a collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. It comprises of (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, 49 million unique queries and 34 billion (query, document, label) triplets were mined.

Dataset consists of entity mentions linked to WikiData, extracted from WikiNews articles. It covers 9 diverse languages, 5 language families and 6 writing systems. It features many WikiData entities that do not appear in English Wikipedia, thereby incentivizing research into multilingual entity linking against WikiData at-large. Langs: Japanese, German, Spanish, Arabic, Serbian, Turkish, Persian, Tamil & English.

Dataset is a multi-lingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

RELX & RELX-Distant

Two datasets for cross-lingual relation classification are included: RELX and RELXDistant. RELX contains 502 parallel sentences per language (total of 5 languages) with 18 relations with direction and no_relation in total of 37 categories. RELX-Distant was extracted from Wikipedia & Wikidata.

Dataset contains 100k annotated utterances in 6 languages (English Germany French Spanish Hindi Thai) across 11 domains. Dataset contains a mix of both simple as well as compositional nested queries across 11 domains, 117 intents and 78 slots.

Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.

News Commentary Parallel Corpus

Dataset consists of parallel corpora consisting of political and economic commentary crawled from the web site Project Syndicate.

Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.

Multilingual Knowledge Questions & Answers (MKQA)

Dataset is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total).

Dataset contains phoneme-level alignments for more than 600 languages, high-resource alignments for ~50 languages, and phonetic measures for all vowels and sibilants. Consists of 690 audio readings of the New Testament of the Bible.

Dataset is parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average.

Dataset is a collection of Quran translations in 42 languages.

Dataset is a collection of sentences and translations.

Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.

COVID-19 Twitter Chatter Dataset

Dataset contains over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st, 2020 to present.

A knowledge graph that connects words and phrases of natural language (terms) with labeled, weighted edges (assertions).

Dataset was collected from online newspapers, it contains 1.5M+ article/summary pairs in 5 languages: French, German, Spanish, Russian, & Turkish.

Cross-lingual Choice of Plausible Alternatives (XCOPA)

Dataset is the translation and reannotation of the English COPA and covers 11 languages: Estonian, Haitian Creole, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese & Mandarin Chinese. The dataset requires both the command of world knowledge and the ability to generalise to new languages.

ParaCrawl Corpus

Multiple parallel datasets of European languages for machine translation.

Leipzig Corpora Collection

Dataset containing 252 languages of web crawled news corpora.

The EUR-Lex Dataset

Dataset is a collection of documents about European Union law. It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed with almost 4,000 labels.

Gutenberg Dialogue

A dataset created by extracting dialogue from the Gutenberg book collection, comprising of ~60,000 books. Currently it supports English, German, Dutch, Spanish, Portuguese, Italian, and Hungarian.

Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian.

The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.

The Cross-lingual Natural Language Inference corpus (XNLI)

Dataset contains collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.

Dataset contains more than 150 political questions, and 67k comments written by candidates on those questions. The questions are available in German, French, Italian and English.

Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)

Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.

MultiLing Pilot 2011 Dataset

Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi.

Train-O-Matic Large

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Train-O-Matic Small

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses.

Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.

Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.

Dataset consists of a subset of 240 context paragraphs and 1,190 question-answer pairs from the development set of SQuAD v1.1 with their translations in 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.

A parallel corpus created from translations of the Bible containing 102 languages.

Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset.

Website and documentation from the European Central Bank. Contains 19 languages.

A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages.

Corpus of documents from the EU bookshop. Contains 48 languages.

Code-Mixed-Dialog

A goal-oriented dialog dataset containing code-mixed conversations. Specifically, text from the DSTC2 restaurant reservation dataset and create code-mixed versions of it in Hindi-English, Bengali-English, Gujarati-English and Tamil-English.

A Novel Approach to a Semantically-Aware Representation of Items (NASARI)

Dataset contains semantic vector representations for BabelNet synsets and Wikipedia pages in several languages: English, Spanish, French, German and Italian. Currently available three vector types: lexical, unified and embedded.

4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset.

Densely Annotated Wikipedia Texts (DAWT)

Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.

Europeana Newspapers

Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format.

The NewsReader MEANTIME Corpus

480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference.

Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.

Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens.

Global Voices Parallel Corpus

Dataset contains news articles from the web site Global Voices in multiple languages.

IWSLT'15 English-Vietnamese

Parallel corpus used for machine translation English-Vietnamese.

MultiLingual Question Answering (MLQA)

Dataset for evaluating cross-lingual question answering performance. ~12K QA instances in English and 5K in each other language in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese.

Parallel Meaning Bank

Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages.

TyDi QA includes question-answer pairs from 11 languages: Arabic, Bengali, English, Finnish, Indonesian, Kiswahili, Russian. Japanese, Korean, Thai, and Telugu.

United Nations Parallel Corpus

Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.

Open Super-Large Crawled Almanach Corpus (OSCAR)

Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available.

Dataset of multi-lingual dialogs from movie scripts. Includes 62 languages.

Web Inventory of Transcribed and Translated Talks (WIT3)

Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages.

The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. Includes English, Russian and Turkish.

Dataset contains various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

Paraphrase Adversaries from Word Scrambling (PAWS-X)

Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Dataset containing audio in 29 languages and 2,454 recorded hours .

Dataset contains data from 25 billion web pages.

The English version of the DBpedia knowledge base currently describes 6.6M entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M resources are classified in a consistent ontology.

DSL Corpus Collection (DSLCC)

Dataset contains short excerpts of journalistic texts in similar languages and dialects.

European Parliament Proceedings (Europarl)

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.

Google Books N-grams

N-grams from a very large corpus of books.

Guttenberg Book Corpus

Dataset contains 60,000 eBooks.

IWSLT 15 English-Vietnamese

Sentence pairs for translation.

Microsoft Speech Language Translation Corpus (MSLT)

Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.

One Week of Global News Feeds

Dataset contains most of the new news content published online over one week in 2017 and 2018.

The Winograd Schema Challenge

Dataset to determine the correct referrent of the pronoun from among the provided choices.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

One Week of Global News Feeds

Dataset contains most of the new news content published online over one week in 2017 and 2018.

The Winograd Schema Challenge

Dataset to determine the correct referrent of the pronoun from among the provided choices.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

WMT 14 English-German

Sentence pairs for translation.

WMT 15 English-Czech

Sentence pairs for translation.

WMT 19 Multiple Datasets

Multiple text corpora in multiple languages.

Worldwide News - Aggregate of 20K Feeds

One week snapshot of all online headlines in 20+ languages.

Dataset is a multi-lingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.