List of Translation Datasets for Machine Learning Projects
High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Translation task, to get started your machine learning projects. Bellow your find a large curated training base for Translation.
What is Translation task?
Language translation, or machine translation, is the use of computer software to translate text from a document in one language to text in another language.
Custom fine-tune with Translation datasets
Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️ Try for free
Found 94 Translation Datasets
Let’s get started!
PheMT
Dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena: Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant.
Business Scene Dialogue (BSD)
Dataset contains 955 scenarios, 30,000 parallel sentences in English-Japanese.
News Commentary Parallel Corpus
Dataset consists of parallel corpora consisting of political and economic commentary crawled from the web site Project Syndicate.
CoVoST
Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.
CodeXGLUE: CodeTrans
Given a piece of Java (C#) code, the task is to translate the code into C# (Java) version. Models are evaluated by BLEU scores, accuracy (exactly match), and CodeBLEU scores.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
JW300
Dataset is parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average.
Tanzil
Dataset is a collection of Quran translations in 42 languages.
Tatoeba
Dataset is a collection of sentences and translations.
Yoruba Text
Multiple datasets scraped together for the Yoruba language.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
ParaCrawl Corpus
Multiple parallel datasets of European languages for machine translation.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
MuST-C
Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Europarl-ST
Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
Bible Corpus
A parallel corpus created from translations of the Bible containing 102 languages.
Bianet
Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
ECB Corpus
Website and documentation from the European Central Bank. Contains 19 languages.
EMEA
A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages.
Eubookshop
Corpus of documents from the EU bookshop. Contains 48 languages.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.
CCMatrix
4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset.
Indic Languages Multilingual Parallel Corpus
Dataset contains several languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain.
WikiMatrix
Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.
Books Corpus
Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens.
Global Voices Parallel Corpus
Dataset contains news articles from the web site Global Voices in multiple languages.
IWSLT'15 English-Vietnamese
Parallel corpus used for machine translation English-Vietnamese.
United Nations Parallel Corpus
Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.
Web Inventory of Transcribed and Translated Talks (WIT3)
Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages.
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
European Parliament Proceedings (Europarl)
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.
IWSLT 15 English-Vietnamese
Sentence pairs for translation.
Microsoft Speech Language Translation Corpus (MSLT)
Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.
WMT 14 English-German
Sentence pairs for translation.
WMT 15 English-Czech
Sentence pairs for translation.
WMT 19 Multiple Datasets
Multiple text corpora in multiple languages.
Worldwide News - Aggregate of 20K Feeds
One week snapshot of all online headlines in 20+ languages.
PheMT
Dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena: Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant.
Business Scene Dialogue (BSD)
Dataset contains 955 scenarios, 30,000 parallel sentences in English-Japanese.
News Commentary Parallel Corpus
Dataset consists of parallel corpora consisting of political and economic commentary crawled from the web site Project Syndicate.
CoVoST
Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.
CodeXGLUE: CodeTrans
Given a piece of Java (C#) code, the task is to translate the code into C# (Java) version. Models are evaluated by BLEU scores, accuracy (exactly match), and CodeBLEU scores.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
JW300
Dataset is parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average.
Tanzil
Dataset is a collection of Quran translations in 42 languages.
Tatoeba
Dataset is a collection of sentences and translations.
Yoruba Text
Multiple datasets scraped together for the Yoruba language.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
ParaCrawl Corpus
Multiple parallel datasets of European languages for machine translation.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
MuST-C
Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Europarl-ST
Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
Bible Corpus
A parallel corpus created from translations of the Bible containing 102 languages.
Bianet
Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
ECB Corpus
Website and documentation from the European Central Bank. Contains 19 languages.
EMEA
A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages.
Eubookshop
Corpus of documents from the EU bookshop. Contains 48 languages.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.
CCMatrix
4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset.
Indic Languages Multilingual Parallel Corpus
Dataset contains several languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain.
WikiMatrix
Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.
Books Corpus
Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens.
Global Voices Parallel Corpus
Dataset contains news articles from the web site Global Voices in multiple languages.
IWSLT'15 English-Vietnamese
Parallel corpus used for machine translation English-Vietnamese.
United Nations Parallel Corpus
Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.
Web Inventory of Transcribed and Translated Talks (WIT3)
Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages.
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
European Parliament Proceedings (Europarl)
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.
IWSLT 15 English-Vietnamese
Sentence pairs for translation.
Microsoft Speech Language Translation Corpus (MSLT)
Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.
WMT 14 English-German
Sentence pairs for translation.
WMT 15 English-Czech
Sentence pairs for translation.
WMT 19 Multiple Datasets
Multiple text corpora in multiple languages.
Worldwide News - Aggregate of 20K Feeds
One week snapshot of all online headlines in 20+ languages.
Classify and extract text 10x better and faster 🦾
Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.