List of Speech Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Speech task, to get started your machine learning projects. Bellow your find a large curated training base for Speech.

What is Speech task?

Speech recognition is a natural language processing (NLP) task that listens to and comprehends human speech. Speech Recognition interprets the spoken words of a natural language such as English as computer commands or data input.

Custom fine-tune with Speech datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️ Try for free

Found 86 Speech Datasets

Let’s get started!

Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags.

BaPOS (IndoNLU)

Dataset contains about 1,000 sentences, collected from the PAN Localization Project. In this dataset, each word is tagged by one of 23 POS tag classes.

Spoken Language Understanding Resource Package (SLURP)

Dataset is a collection of ~72K audio recordings of single turn user interactions with a home assistant, annotated with three levels of semantics: Scenario, Action and Entities, including over 18 different scenarios, with 46 defined actions and 55 different entity types.

Tunisian Arabish Corpus (TArC)

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.

Bengali Hate Speech

Dataset contains Bengali text classified into 5 categories: personal hate, political hate, religious hate, geopolitical hate, & gender abusive hate.

Offensive Language Identification Dataset (OLID)

Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.

Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.

Dataset contains 1,150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.

Dataset includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts.

Dataset used for classifying condescending acts in context. Dataset was extracted from Reddit COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending.

Open-Domain Spoken Question Ansswering Dataset (ODSQA)

Dataset contains questions in both text and spoken forms, a multi-sentence spoken-form document and a word span answer based from the document.

Dataset is used to detect hateful memes. In total, the datset contains 10,000 memes comprising of five different types: multimodal hate, where benign confounders were found for both modalities, unimodal hate, where one or both modalities were already hateful on their own, benign image, benign text confounders and finally random not-hateful examples.

Dataset is used for speech source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it.

Korean Hate Speech Dataset

Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.

Dataset contains recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.

CSTR VCTK Corpus

Dataset contains speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.

WSJ0 Hipster Ambient Mixtures (WHAM!)

Dataset consists of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area.

Dataset is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.

Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.

KALIMAT Multipurpose Arabic Corpus

Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.

Dataset contains 60K hours of unlabelled speech from audiobooks in English and a small labelled data set (10h, 1h, and 10 min).

Dataset contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated for part-of-speech, contituency syntactic, terms, events, relations, and coreference.

Korean Single Speaker Dataset (KSS)

Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books.

Arabic Speech Corpus

Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.

Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)

Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.

Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.

Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.

Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.

Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.

Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)

Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.

Conference on Computational Natural Language Learning (CoNLL 2003)

Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.

Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset

Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers.

Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Dataset containing audio in 29 languages and 2,454 recorded hours .

LibriSpeech ASR

Large-scale (1000 hours) corpus of read English speech.

Microsoft Information-Seeking Conversation (MISC) dataset

Dataset contains recordings of information-seeking conversations between human “seekers” and “intermediaries”. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort.

Microsoft Speech Corpus

Dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages.

Microsoft Speech Language Translation Corpus (MSLT)

Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.

Voices Obscured in Complex Environmental Settings (VOiCES)

Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

Voices Obscured in Complex Environmental Settings (VOiCES)

Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags.

BaPOS (IndoNLU)

Dataset contains about 1,000 sentences, collected from the PAN Localization Project. In this dataset, each word is tagged by one of 23 POS tag classes.

Spoken Language Understanding Resource Package (SLURP)

Dataset is a collection of ~72K audio recordings of single turn user interactions with a home assistant, annotated with three levels of semantics: Scenario, Action and Entities, including over 18 different scenarios, with 46 defined actions and 55 different entity types.

Tunisian Arabish Corpus (TArC)

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.

Bengali Hate Speech

Dataset contains Bengali text classified into 5 categories: personal hate, political hate, religious hate, geopolitical hate, & gender abusive hate.

Offensive Language Identification Dataset (OLID)

Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.

Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours. The total number of speakers is 78K.

Dataset contains 1,150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.

Dataset includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts.

Dataset used for classifying condescending acts in context. Dataset was extracted from Reddit COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending.

Open-Domain Spoken Question Ansswering Dataset (ODSQA)

Dataset contains questions in both text and spoken forms, a multi-sentence spoken-form document and a word span answer based from the document.

Dataset is used to detect hateful memes. In total, the datset contains 10,000 memes comprising of five different types: multimodal hate, where benign confounders were found for both modalities, unimodal hate, where one or both modalities were already hateful on their own, benign image, benign text confounders and finally random not-hateful examples.

Dataset is used for speech source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it.

Korean Hate Speech Dataset

Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.

Dataset contains recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.

CSTR VCTK Corpus

Dataset contains speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.

WSJ0 Hipster Ambient Mixtures (WHAM!)

Dataset consists of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area.

Dataset is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.

Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.

KALIMAT Multipurpose Arabic Corpus

Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.

Dataset contains 60K hours of unlabelled speech from audiobooks in English and a small labelled data set (10h, 1h, and 10 min).

Dataset contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated for part-of-speech, contituency syntactic, terms, events, relations, and coreference.

Korean Single Speaker Dataset (KSS)

Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books.

Arabic Speech Corpus

Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.

Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)

Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.

Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form.

Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.

Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.

Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.

Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)

Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.

Conference on Computational Natural Language Learning (CoNLL 2003)

Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.

Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset

Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers.

Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Dataset containing audio in 29 languages and 2,454 recorded hours .

LibriSpeech ASR

Large-scale (1000 hours) corpus of read English speech.

Microsoft Information-Seeking Conversation (MISC) dataset

Dataset contains recordings of information-seeking conversations between human “seekers” and “intermediaries”. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort.

Microsoft Speech Corpus

Dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages.

Microsoft Speech Language Translation Corpus (MSLT)

Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data.

Voices Obscured in Complex Environmental Settings (VOiCES)

Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

Voices Obscured in Complex Environmental Settings (VOiCES)

Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.

An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.