Curated NLP Database

List of 1000+ Natural Language Processing Datasets

Covering tasks from classification to question answering, languages from English, Portuguese to Arabic. We hope you find this library useful in your development endeavors

Author Dataset Description Instances Language Task Paper Download
Ramos et al. B5 Corpus Dataset is a collection of Facebook posts, including information about brazilian authors, like gender, age, personality score (Based in B5 test), education level, politic position, religious, and others. 1012 Portuguese Text Classification Paper Link
Weller et al. Zero Shot Learning from Task Descriptions (ZEST) Dataset used for zero-shot prediction that is formatted similarly to reading comprehension datasets, where the authors formulate task descriptions as questions and pair them with paragraphs of text. 25,026 English Zero Shot Prediction Paper Link
Farahani WikiSummary A summarization dataset extracted from Wikipedia. 56,363 Persian Summarization Paper Link
Upadhayay et al. Sentimental LIAR Sentimental LIAR dataset is a modified and further extended version of the original LIAR dataset. It was modified to be a binary-label dataset that was then extended by adding sentiments derived using the Google NLP API. n/a English Classification, Fake News Detection Paper Link
Malo et al. FinancialPhraseBank Dataset contains the sentiments for financial news headlines from the perspective of a retail investor. 4,837 English Sentiment Analysis Paper Link
Tandon et al. WebChild Dataset contains triples that connect nouns with adjectives via fine-grained relations like hasShape, hasTaste, evokesEmotion, etc. The arguments of these assertions, nouns and adjectives, are disambiguated by mapping them onto their proper WordNet senses. 4M triples English Commonsense, Knowledge Base Paper Link
Ilmania et al. CASA (IndoNLU) An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral. 1,08 Indonesian Classification, Sentiment Analysis Paper Link
Azhar et al. HoASA (IndoNLU) An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative. 2,854 Indonesian Classification, Sentiment Analysis Paper Link
Setya and Mahendra et al. The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU) Dataset consists of 450 sentence pairs constructed from Wikipedia revision history. It contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise. 450 Indonesian Natural Language Inference (NLI) Paper Link
Hoesen and Purwarianti et al. POSP (IndoNLU) Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags. 8,4 Indonesian Part-of-Speech (POS) Paper Link


Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise. Try it for free!

➡️  Learn more
Was this page helpful? Share to help more people.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.