List of English Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of English NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with English datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
āž”ļø  Try for free


Found 1174 English Datasets

Letā€™s get started!

Zero Shot Learning from Task Descriptions (ZEST)
Dataset used for zero-shot prediction that is formatted similarly to reading comprehension datasets, where the authors formulate task descriptions as questions and pair them with paragraphs of text.
Sentimental LIAR
Sentimental LIAR dataset is a modified and further extended version of the original LIAR dataset. It was modified to be a binary-label dataset that was then extended by adding sentiments derived using the Google NLP API.
FinancialPhraseBank
Dataset contains the sentiments for financial news headlines from the perspective of a retail investor.
WebChild
Dataset contains triples that connect nouns with adjectives via fine-grained relations like hasShape, hasTaste, evokesEmotion, etc. The arguments of these assertions, nouns and adjectives, are disambiguated by mapping them onto their proper WordNet senses.
HOVER
Dataset is an open-domain, many-hop fact extraction and claim verification dataset built upon the Wikipedia corpus. The original 2-hop claims are adapted from question-answer pairs from HotpotQA.
Stack Overflow Question-Code Pairs (StaQC)
Dataset contains 148K Python and 120K SQL domain question-code pairs, which were mined from Stack Overflow.
IIRC
Dataset contains more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents.
Spoken Language Understanding Resource Package (SLURP)
Dataset is a collection of ~72K audio recordings of single turn user interactions with a home assistant, annotated with three levels of semantics: Scenario, Action and Entities, including over 18 different scenarios, with 46 defined actions and 55 different entity types.
SubjQA
Dataset is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer).
MAVEN
Dataset contains 4,480 Wikipedia documents, 118,732 event mention instances, and 168 event types.
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
PerSenT
Dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3K documents and 38K paragraphs covering 3.2K unique entities.
MK-SQuIT
Dataset contains 110,000 English question and SPARQL query pairs across four WikiData domains.
PheMT
Dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena: Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant.
EHR-Rel
A benchmark dataset for biomedical concept relatedness, consisting of 3,630 concept pairs sampled from electronic health records (EHRs).
2WikiMultihopQA
A multihop QA dataset, which uses structured and unstructured data. It includes the evidence information containing a reasoning path for multi-hop questions.
CoNLL 2003 ++
Similar to the original CoNLL except test set has been corrected for label mistakes. The dataset is split into training, development, and test sets, with 14,041, 3,250, and 3,453 instances respectively.
Open-Retrieval Conversational Question Answering (ORConvQA)
Dataset enhances QuAC by adapting it to an open retrieval setting. It is an aggregation of 3 existing datasets: (1) the QuAC dataset that offers information-seeking conversations, (2) the CANARD dataset that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions.
KB-Ref
Dataset is a referring expression comprehension dataset containing 43K expressions on 16K images. Different with other referring expression dataset, it requires that each referring expression must use at least one external knowledge (the information can not be got from the image).
ACL Citation Coreference Corpus
Dataset was constructed from papers from proceedings of the ACL conference in 2007 and 2008. Text was annotated for the coreference resolution task.
COMETA
Dataset is an entity linking dataset of layman medical terminology. It consists of 20K English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph.
WNUT 2016
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format.
GrailQA
Dataset contains 64,331 crowdsourced questions involving up to 4 relations and functions like counting, comparatives, and superlatives. The dataset covers all the 86 domains in Freebase Commons.
ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Social Narrative Tree
Dataset contains 1,250 stories documenting a variety of daily social interactions.
Multi-Xscience
A multi-document summarization dataset created from scientific articles. MultiXScience introduces a challenging multidocument summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of āˆ¼18M sentences spanning āˆ¼45M triples with āˆ¼1,500 distinct relations from English Wikidata.
TriageSQL
Dataset is a cross-domain text-to-SQL question intention classification benchmark. It contains 34K databases and 390K questions from 20 existing datasets.
TweetEval
TweetEval consists of seven tasks in Twitter, all framed as multi-class tweet classification. Emotion Recognition, Emoji Prediction, Irony Detection, Hate Speech Detection, Offensive Language Identification, Sentiment Analysis, & Stance Detection.
Tree-Based Dialog State Tracking (TreeDST)
Dataset is a multi-turn, multi-domain task-oriented dialog dataset annotated with tree-based user dialog states and system dialog acts. The goal is to provide a novel solution for end-to-end dialog state tracking as a conversational semantic parsing task. In total, it contains 27,280 conversations covering 10 domains with shared types of person, time and location.
Acronym Detection Dataset
Dataset contains 62,441 samples where each sample involves a sentence, an ambiguous acronym, and its correct meaning. Samples came from scientific papers from arXiv.
Acronym Identification
Task is to to find the acronyms and the phrases that have been abbreviated by the acronyms in the document.
CC100-English
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 82G.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Open Table-and-Text Question Answering (OTT-QA)
Dataset contains open questions which require retrieving tables and text from the web to answer. The dataset is built on the HybridQA dataset.
Taskmaster-3
Dataset consists of 23,757 movie ticketing dialogs. "Movie ticketing" is defined as conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction.
STAR
A schema-guided task oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog.
Scruples
Scruples contains 2 datasets: Anecdotes and Dilemmas. Anecdotes contains 32,000 real-life anecdotes about complex ethical situations, with 625,000 ethical judgments extracted from reddit. Dilemmas contains 10,000 ethical dilemmas in the form of paired actions, where the model must identify which one was considered less ethical by crowd workers on Mechanical Turk.
Elsevier OA CC-BY
Dataset contains 40, 091 open access (OA) CC-BY articles from across Elsevierā€™s journals.
Microsoft News Dataset (MIND)
Dataset contains ~160k English news articles and more than 15 million impression logs generated by 1 million users.
Situated and Interactive Multimodal Conversations (SIMMC)
There are 2 datasets totalling āˆ¼13K human-human dialogs (āˆ¼169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images).
HINT3
In total there are three datasets: SOFMattress, Curekart and Powerplay11 with each containing diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming respectively. Each datasets spans multiple coarse and fine grain intents, with the test sets being drawn entirely from actual user queries on live systems at scale instead of being crowdsourced.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
NatCat
Dataset contains naturally annotated category-text pairs for training text classifiers derived from 3 sources: Wikipedia, Reddit, and Stack Exchange.
DialoGLUE
Benchmark for task oriented dialogue containing 7 datasets: Banking77 containing online banking queries, HWU64 containing popular personal assistant queries, CLINC150 containing popular personal assistant queries, Restaurant8k containing restaurant booking domain queries, DSTC8 SGD containing multi-domain, task-oriented conversations between a human and a virtual assistant, and TOP containing compositional queries for hierachical semantic representations, MultiWOZ 2.1 12K multi-domain dialogues with multiple turns.
Offensive Language Identification Dataset (OLID)
Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.
Business Scene Dialogue (BSD)
Dataset contains 955 scenarios, 30,000 parallel sentences in English-Japanese.
English Possible Idiomatic Expressions (EPIE)
Dataset containing 25,206 sentences labelled with lexical instances of 717 idiomatic expressions.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
Inquisitive
Dataset contains āˆ¼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
Constructive Comments Corpus (C3)
Dataset is a subset of comments from the SFU Opinion and Comments Corpus. This subset, the Constructive Comments Corpus (C3) consists of 12,000 comments annotated by crowdworkers.
Numeric Fused-Heads
Dataset contains annotated sentences of numeric-fused-heads, along with their "missing head". A number refers to an implicit (and not explicitly provided) reference. For example, in the sentence "I miss being 10", the number 10 refers to the age of 10, but is not explicitly said.
LEDGAR
LEDGAR is a multilabel corpus of legal provisions in contracts suited for text classification in the legal domain (legaltech). It features over 1.8M+ provisions and a set of 180K+ labels. A smaller, cleaned version of the corpus is also available.
Olpbench
Dataset contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities. Dataset is used for open link prediction task.
FewRel 1.0
Dataset is a few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains.
SemEval2010 Task 8
Dataset consists of 8,000 sentences annotated for Cause-Effect , Instrument-Agency, Product-Producer, Content-Container, Entity-Origin, Entity-Destination, Component-Whole, Member-Collection, and Message-Topic.
KnowledgeNet
KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web.
Evidence Inference
Dataset contains 10,137 annotated prompts for 2,419 unique article with the task of inferring whether a given clinical treatment is effective with respect to a specified outcome. The dataset provides a prompt that specifies an intervention, a comparator, and an outcome, along with a fulltext article. The model is then used to infer the reported findings with respect to this prompt.
DailyDialog++
DailyDialog++ is an open-domain dialogue evaluation dataset consisting of 19k contexts with five relevant responses for each context. Additionally for 11k contexts, it includes five adversarial irrelevant responses which are specifically crafted to have lexical or semantic overlap with the context but are still unacceptable as valid responses.
LogiQA
Dataset consists of 8,678 QA instances, covering multiple types of deductive reasoning. Multiple-choice.
SCDE
Dataset of sentence-level cloze questions sourced from public school examinations. Each instance consists of a passage with multiple sentence-level blanks and a shared set of candidates. Besides the right answer to each cloze in the passage, the candidate set also contains ones which donā€™t answer any cloze, called distractors. [requires contacting authors for data]
CoDEx 
Three graph datasets containing positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.
QED
Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection, referential equality, and predicate entailment. Dataset was built as a subset of the Natural Questions dataset.
SMCalFlow
Dataset contains natural conversations about tasks involving calendars, weather, places, and people. Each turn is annotated with an executable dataflow program featuring API calls, function composition, and complex constraints built from strings, numbers, dates and times.
Critical Role Dungeons and Dragons Dataset (CRD3)
Dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding abstractive summaries collected from the Fandom wiki. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons.
The Semantic Scholar Open Research Corpus (S2ORC)
Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.
Wiki-CS
Dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field.
Semantic Parsing with Language Assistance from Humans (SPLASH)
Dataset enables text-to-SQL systems to seek and leverage human feedback to further improve the overall performance and user experience. Dataset contains 9,314 question-feedback pairs, 8,352 of which, correspond to questions in the Spider training split and 962 from the spider development split.
ClariQ
Dataset consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations.
DialogRE
Dataset contains human-annotated dialogue-based relation extraction containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy "Friends". There are 36 possible relation types that exist between an argument pair in a dialogue.
Visual Genome
Dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects.
CraigsListBargain
Dataset contains 6,682 human-human dialogues where 2 agents negotiate the sale/purchase of an item.
A Multi-Turn, Multi-Domain Dialogue Dataset (KVRET)
Dataset contains 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation.
CMU_ARCTIC
Dataset contains 1,150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.
L2-ARTIC
Dataset includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMUā€™s ARCTIC prompts.
ACL Anthology Reference Corpus (ACL ARC)
Dataset contains 10,921 articles from the February 2007 snapshot of the Anthology; text and metadata for the articles were extracted, consisting of BibTeX records derived either from the headers of each paper or from metadata taken from the Anthology website.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
TalkDown
Dataset used for classifying condescending acts in context. Dataset was extracted from Reddit COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending.
Implicature and Presupposition Diagnostic dataset (IMPPRES)
Dataset contains semiautomatically generated sentence pairs illustrating well-studied pragmatic inference types. IMPPRES follows the format of SNLI, MultiNLI and XNLI, which was created to evaluate how well trained NLI models recognize several classes of presuppositions and scalar implicatures.
Abstractive Sentence Simplification Evaluation and Tuning (ASSET)
Dataset consists of 23,590 human simplifications associated with the 2,359 original sentences from TurkCorpus (10 simplifications per original sentence).
Visual Commonsense Graphs
Dataset consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 59,000 images, each paired with short video summaries of before and after.
Story Commonsense
Dataset contains a total of 300k low-level annotations for motivation and emotion across15,000 stories (randomly selected from the ROC story training set). It covers over 150,000 character-line pairs, in which 56k character-line pairs have an annotated motivation and 105k have an annotated change in emotion (i.e. a label other than none).
StereoSet
Dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.
Hippocorpus
Dataset of 6,854 English diary-like short stories about recalled and imagined events.
Web Demonstration and Explanation Dataset (Web-D-E)
Dataset consists of 520 explanations and corresponding demonstrations of web-based tasks from the Mini Word-of-Bits.
GeNeVA
Data contains the CoDraw and i-CLEVR datasets used for the Generative Neural Visual Artist (GeNeVA) task.
FigureQA
Dataset is a visual reasoning corpus of over one million question answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.
BioCreative II Gene Mention Recognition (BC2GM)
Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access]
BC5CDR Drug/Chemical (BC5-Chem)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
BC5CDR Disease (BC5-Disease)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
JNLPBA
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. 
NCBI Disease Corpus
Dataset contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier.
EBM PICO
Dataset contains ~5,000 medical abstracts describing clinical trials, annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels).
ChemProt
ChemProt [is] a disease chemical biology database, which is based on a compilation of multiple chemicalā€“protein annotation resources, as well as disease-associated proteinā€“protein interactions (PPIs). [registration required for access]
Drug-Disease Interaction (DDI)
Dataset contains 792 texts selected from the DrugBank database and other 233 Medline abstracts. This fined-grained corpus has been annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both PK as well as PD interactions.
Gene-Disease Associations (GAD)
Dataset is an archive of published genetic association studies that provides a comprehensive, public, web-based repository of molecular, clinical and study parameters for >5,000 human genetic association studies at this time.
BIOSSES
Dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. TAC dataset consists of 20 articles (reference articles) and citing articles that vary from 12 to 20 for each of the reference articles.
HoC (Hallmarks of Cancer)
Dataset consists of 1,852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy.
PubmedQA
A biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe.
TRACT: Tweets Reporting Abuse Classification Task Corpus
Dataset used for multi-class classification task involving three classes of tweets that mention abuse reportings: "report" (annotated as 1); "empathy" (annotated as 2); and "general" (annotated as 3).
Hateful Memes
Dataset is used to detect hateful memes. In total, the datset contains 10,000 memes comprising of five different types: multimodal hate, where benign confounders were found for both modalities, unimodal hate, where one or both modalities were already hateful on their own, benign image, benign text confounders and finally random not-hateful examples.
Adverse Drug Effect (ADE) Corpus
There's 3 different datasets: DRUG-AE.rel provides relations between drugs and adverse effects, DRUG-DOSE.rel provides relations between drugs and dosages and ADE-NEG.txt provides all sentences in the ADE corpus that DO NOT contain any drug-related adverse effects.
MEDIQA-Answer Summarization
Dataset containing question-driven summaries of answers to consumer health questions.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
Wikipedia Current Events Portal (WCEP) Dataset
Dataset is used for multi-document summarization (MDS) and consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event.
Worldtree Corpus
Dataset contains multi-hop question answering/explanations where questions require combining between 1 and 16 facts (average 6) to generate detailed explanations for question answering inference. Each explanation is represented as a lexically-connected ā€œexplanation graphā€ that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables.
ScienceExamCER
Dataset contains 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms.
LibriMix
Dataset is used for speech source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it.
TVQA
Dataset is used for video question answering and consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video.
MovieQA
Dataset used to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies.
TGIF-QA
Dataset consists of 165K QA pairs from 72K animated GIFs. Used for video question answering.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ArxivPapers
Dataset is a corpus of over 100,000 scientific papers related to machine learning.
SegmentedTables & LinkedResults
Dataset mentions in captions, the type of table (leaderboard, ablation, irrelevant) and ground truth cell annotations into classes: dataset, metric, paper model, cited model, meta and task.
CodeSearchNet Corpus
Dataset contains functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.
CompGuessWhat?!
Dataset contains 65,700 dialogues based on GuessWhat?! dataset dialogues and enhanced by including object attributes coming from resources such as VISA attributes, VisualGenome and ImSitu.
Gigaword
Dataset contains headline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles.
Opinosis
Dataset contains sentences extracted from reviews for 51 topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.
BillSum
Dataset contains a summarization of US Congressional and California state bills.
SAMSum
Dataset contains over 16K chat dialogues with manually annotated summaries.
Annotated Enron Subject Line Corpus (AESLC)
Dataset contains email messages of employees in the Enron Corporation.
Multi-News
Dataset consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
News Category Dataset
Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
NIPS Papers
Dataset contains the title, authors, abstracts, and extracted text for all NIPS papers between 1987-2016.
CSTR VCTK Corpus
Dataset contains speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
Open Resource for Click Analysis in Search (ORCAS)
ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
ClarQ
Dataset consists of āˆ¼2M question/post tuples distributed across 173 domains of stackexchange.
ManyModalQA
Dataset contains 10,190 questions, 2,873 images, 3,789 text, and 3,528 tables scraped from Wikipedia.
Polusa
Dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum.
DocBank
Dataset contains fine-grained token-level annotations for document layout analysis. It includes 5,053 documents and both the validation set and the test set include 100 documents.
Get it #OffMyChest
Dataset is used for affective understanding of conversations focusing on the problem of how speakers use emotions to react to a situation and to each other. Posts were taken from the 2018 top reddit posts from /r/CasualConversations and /r/OffMyChest.
AirDialogue
Dataset contains 402,038 goal-oriented conversations.
WSJ0 Hipster Ambient Mixtures (WHAM!)
Dataset consists of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area.
Crema-D
Dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected.
LibriTTS
Dataset is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.
Ljspeech
Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Surrey Audio-Visual Expressed Emotion (SAVEE)
Dataset consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion.
Civil Comments
Dataset contains the archive of the Civil Comments platform. Dataset was annotated for toxicity.
Common Sense Explanations (CoS-E)
Dataset used to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework.
e-SNLI
Dataset contains human-annotated natural language explanations of the entailment relations.
1 Billion Word Language Model Benchmark (lm1b)
Dataset used for measuring progress in statistical language modeling.
Math Dataset
Dataset contains mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
SciCite
Dataset used for classifying citation intents in academic papers. The main citation intent label for each JSON object is specified with the label key while the citation context is specified in with a context key.
WordNet
Dataset is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
Yelp Polarity Reviews
Dataset contains 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. Dataset from FastAI's website.
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
MMD
Dataset contains over 150K conversation sessions between shoppers and sales agents.
ParaBank
Dataset contains paraphrases with 79.5 million references and on average 4 paraphrases per reference.
Humicroedit
Dataset contains 15,095 edited news headlines and their numerically assessed humor.
VQA-Introspect
Dataset consists of 238K new perception questions from the VQA dataset which serve as sub questions corresponding to the set of perceptual tasks needed to answer complex reasoning questions.
Talk the Walk
Dataset consists of over 10k crowd-sourced dialogues in which two human annotators collaborate to navigate to target locations in the virtual streets of NYC.
FB15K-237 Knowledge Base Completion Dataset
Dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs.
WN18RR
Dataset contains knowledge base relation triples from WordNet.
AmbigNQ
Dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark.
OpenKeyPhrase (OpenKP)
Open domain keyphrase extraction dataset containing 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases.
Flickr30K Entities
Dataset contains 244k coreference chains and 276k manually annotated bounding boxes for each of the 31,783 images and 158,915 English captions (five per image) in the original dataset.
Street View Text (SVT)
Dataset contains images with textual content used for scene text recognition.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
Rotowire and SBNation Datasets
Dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box and line scores.
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
LogicNLG
Dataset is a table-based factchecking dataset with rich logical inferences in the annotated statements.
FakeNewsNet
Repo contains two datasets with news content, social context, and spatiotemporal information from Politifact and Gossipcop.
LIAR Dataset
Dataset contains 12.8K manually labeled short statements in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case.
Dialogue-Based Reading Comprehension Examination (DREAM)
Dataset contains 10,197 multiple choice questions for 6,444 dialogues, collected from English-as-a-foreign-language examinations designed by human experts. DREAM is likely to present significant challenges for existing reading comprehension systems: 84% of answers are non-extractive, 85% of questions require reasoning beyond a single sentence, and 34% of questions also involve commonsense knowledge.
The New York Times Annotated Corpus
Dataset contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom.
BigPatent
Dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
Libri-Light
Dataset contains 60K hours of unlabelled speech from audiobooks in English and a small labelled data set (10h, 1h, and 10 min).
Atlas of Machine Commonsense (ATOMIC)
Dataset is a knowledge graph of 877K textual description triples of inferential knowledge.
Genia
Dataset contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated for part-of-speech, contituency syntactic, terms, events, relations, and coreference.
DNA Methylation Corpus
Dataset contains 200 abstracts including a representative sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for nearly 3,000 gene/protein mentions and 1,500 DNA methylation and demethylation events.
Exhaustive PTM Corpus
Dataset contains 360 abstracts manually annotated in the BioNLP Shared Task event representation for over 4,500 mentions of proteins and 1,000 statements of modification events of nearly 40 different types.
mTOR Pathway Corpus
Dataset contains 1,300 annotated event instances of protein associations and dissociation reactions.
PTM Event Corpus
Dataset contains 157 PubMed abstracts annotated for over 1,000 proteins and 400 post-translational modification events identifying the modified proteins and sites.
T4SS Event Corpus
Dataset contains 27 full text publications totaling 15,143 pseudo-sentences (text sentences plus table rows, references, etc.) and 244,942 tokens covering 4 classes: Bacteria, Cellular components, Biological Processes, and Molecular functions.
Abstract Meaning Respresentation (AMR) Bank
Dataset contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text.
Adversarial NLI (ANLI)
Dataset is an NLI benchmark created via human-and-model-in-the-loop enabled training (HAMLET). Human was tasked to provide a hypothesis that fools the model into misclassifying the label.
SCITLDR
Dataset of a combination of TLDRs written by human experts and author written TLDRs of computer science papers from OpenReview.
Self-Annotated Reddit Corpus (SARC)
Dataset contains 1.3 million sarcastic comments from the Internet commentary website Reddit. It contains statements, along with their responses as well as many non-sarcastic comments from the same source.
Fact Extraction and Verfication (FEVER)
Dataset contains 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as supported, rufted or notenoughinfo.
DuoRC
Dataset contains 186,089 unique question-answer pairs created from a collection of 7,680 pairs of movie plots where each pair in the collection reflects two versions of the same movie.
ColBERT
Dataset contains 200k short texts (100k positive, 100k negative). Used for humor detection.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
Logic2Text
Dataset contains 5,600 tables and 10,753 descriptions involving common logic types paired with the underlying logical forms.
OneStopQA
Dataset comprises 30 articles from the Guardian in 3 parallel text difficulty versions and contains 1,458 paragraph-question pairs with multiple choice questions, along with manual span markings for both correct and incorrect answers.
Audio Visual Scene-Aware Dialog (AVSD)
Dataset consists of text-based human conversations about short videos from the Charades dataset.
Will-They-Won't-They (WT-WT)
Dataset of English tweets targeted at stance detection for the rumor verification task.
SciREX
Dataset is fully annotated with entities, their mentions, their coreferences, and their document level relations.
GoEmotions
Dataset contains 58K carefully curated Reddit comments labeled for 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, & surprise.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
DoQa
Dataset contains domain specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies.
Personal Events in Dialogue Corpus
Dataset is a corpus containing annotated dialogue transcripts from fourteen episodes of the podcast This American Life. It contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 represent events.
Quda
Dataset contains 14,035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art machine/deep learning techniques for parsing complex human language.
DramaQA
Dataset contains 16,191 question answer pairs from 23,928 various length video clips, with each question answer pair belonging to one of four difficulty levels.
Statutory Reasoning Assessment (SARA)
Dataset contains a set of rules extracted from the statutes of the US Internal Revenue Code (IRC), together with a set of natural language questions which may only be answered correctly by referring to the rules.
InfoTabs
Dataset contains human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.
emrQA
Dataset contains 1M question-logical form and 400,000+ question answer evidence pairs on electronic medical records. In total, there are 2,495 clinical notes.
Credbank
Dataset comprises more than 60M tweets grouped into 1,049 real-world events, each annotated by 30 human annotators.
BuzzFace
Dataset focused on news stories (which are annotated for veracity) posted to Facebook during September 2016 consisting of: Nearly 1.7 million Facebook comments discussing the news content, Facebook plugin comments, Disqus plugin comments, Associated webpage content of the news articles.
Some Like it Hoax
Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific).
Focused Open Biology Information Extraction (FOBIE)
Dataset contains 1,500 manually-annotated sentences that express domain-independent relations between central concepts in a scientific biology text, such as trade-offs and correlations.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
FreebaseQA
Dataset contains 28,348 unique questions for open domain QA over the Freebase knowledge graph.
Deft
Dataset contains annotated content from two different data sources: 1) 2,443 sentences from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences from open source textbooks including topics in biology, history, physics, psychology, economics, sociology, and government.
Clash of Clans
Dataset contains 50K user comments, both from the iTunes App Store and Google Play. The dataset spans from Oct 18, 2018 to Feb 1, 2019.
IRC Disentanglement
Dataset contains 77,563 messages of internet relay chat (IRC). Almost all are from the Ubuntu IRC Logs.
Action Learning From Realistic Environments and Directives (ALFRED)
Dataset contains 8k+ expert demostrations with 3 or more language annotations each comprising of 25,000 language directives. A trajectory consists of a sequence of expert actions, the corresponding image observations, and language annotations describing segments of the trajectory.
Visual Storytelling Dataset (VIST)
Dataset contains 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).
All the News 2.0
Dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Datasets Knowledge Embedding
Several datasets containing edges and nodes for knowledge base building.
Frames
Dataset contains 1,369 human-human dialogues with an average of 15 turns per dialogue. This corpus contains goal-oriented dialogues between users who are given some constraints to book a trip and assistants who search a database to find appropriate trips.
SemEval-2014 Task 3
Dataset is used for cross-level semantic similarity which measures the degree to which the meaning of a larger linguistic item, such as a paragraph, is captured by a smaller item, such as a sentence.
SemEval-2019 Task 6 
Dataset containing tweets as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B ā€“ C).
WNUT 2017
Dataset containing tweets, reddit comments, YouTube comments, and StackExchange were annotated with 6 entities: Person, Location, Corporation, Consumer good, Creative work, and Group.
MPQA Opinion Corpus
Dataset contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
Ohsumed Dataset
Dataset containing references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991).
SelQA
Dataset provides crowdsourced annotation for two selection-based question answer tasks, answer sentence selection and answer triggering. Our dataset composes about 8K factoid questions for the top-10 most prevalent topics among Wikipedia articles.
PubMed 200k RCT Dataset
Dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences.
Textual Visual Semantic Dataset
A dataset consisting of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). Around 82,000 images.
EventQA
A dataset for answering Event-Centric questions over Knowledge Graphs (KGs). It contains 1,000 semantic queries and the corresponding verbalisations.
Sequential Question Answering (SQA)
Dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.
WikiTablesQuestions
Dataset is for the task of question answering on semi-structured HTML tables.
DocRed
Dataset was constructed from Wikipedia and Wikidata. It annotates both named entities and relations.
Complex Sequential Question Answering (CSQA)
Dataset contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.
Linked WikiText-2
Dataset contains over 2 million tokens from Wikipedia articles, along with annotations linking mentions to their corresponding entities and relations in Wikidata.
OpenDialKG
Dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. Each dialog turn is paired with its corresponding ā€œKG pathsā€ that weave together the KG entities and relations that are mentioned in the dialog.
BuGL
Dataset consists of 54 GitHub projects of four different programming languages namely C, C++, Java and Python with around 10,187 issues.
HybridQA
Dataset contains over 70K question-answer pairs based on 13,000 tables, each table is in average linked to 44 passages.
PoKi
Dataset is a corpus of 61,330 poems written by children from grades 1 to 12.
MuTual
Retrieval-based dataset for multi-turn dialogue reasoning, which is modified from Chinese high school English listening comprehension test data.
ToTTo
Dataset is used for the controlled generation of descriptions of tabular data comprising over 100,000 examples. Each example is a aligned pair of a highlighted table and the description of the highlighted content.
VIdeO-and-Language INference (VIOLIN)
Dataset contains 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows). Inference descriptions of video content were annotated. Inferences are used to measure entailment vs video clip.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
ReClor
Dataset contains logical reasoning questions of standardized graduate admission examinations.
Compositional Freebase Questions (CFQ)
Dataset contains questions and answers that also provides for each question a corresponding SPARQL query against the Freebase knowledge base.
MoviE Text Audio QA (MetaQA)
Dataset contains more than 400K questions for both single and multi-hop reasoning, and provides more realistic text and audio versions. MetaQA serves as a comprehensive extension of WikiMovies.
WebQuestions
Dataset contains 6,642 question/answer pairs. The questions are supposed to be answerable by Freebase, a large knowledge graph. The questions are mostly centered around a single named entity.
MathQA
Dataset contains English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset.
SherLIiC
Dataset contains manually annotated inference rule candidates (InfCands), accompanied by ~960k unlabeled InfCands, and ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
Multimodal Sarcasm Detection Dataset (MUStARD)
The dataset, a multimodal video corpus, consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs.
Multimodal EmotionLines Dataset (MELD)
Dataset contains the same dialogue instances available in EmotionLines dataset, but it also encompasses audio and visual modality along with text. It has more than 1,400 dialogues and 13,000 utterances from Friends TV series. Each utterance in a dialogue has been labeled by any of these seven emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. It also has sentiment (positive, negative and neutral) annotation for each utterance.
Book Depository Dataset
Dataset contains books from bookdepository.com, not the actual content of the book but a list of metadata like title, description, dimensions, category and others.
COVID-19 Open Research Dataset (CORD-19)
Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Neutralizing Biased Text
A parallel corpus of 180,000+ sentence pairs where one sentence is biased and the other is neutralized. The data were obtained from debiasing wikipedia edits.
Wikipedia News Corpus
Text from Wikipedia's current events page with dates.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
Taskmaster-2
Dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478). It consists entirely of spoken two-person dialogs.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
The TAC Relation Extraction Dataset (TACRED)
A relation extraction dataset containing 106k+ examples covering 42 TAC KBP relation types. Costs $25 for non-members.
Webis-TLDR-17 Corpus
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Webis-Snippet-20 Corpus
Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected.
WSD English All-Words Fine-Grained Datasets
Unified five standard all-words Word Sense Disambiguation datasets.
Curation Corpus
Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviewsā€™ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
LC-QuAD 2.0
Dataset contains questions and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.
X-Sum
The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
Open Images V6
Dataset containing millions of images that have been annotated with image-level labels and object bounding boxes.
Explain Like Iā€™m Five (ELI5)
The dataset contains 270K threads of open-ended questions that require multi-sentence answers. It was extracted from subreddit titled ā€œExplain Like Iā€™m Fiveā€ (ELI5), in which an online community answers questions with responses that 5-year-olds can comprehend. Facebook scripts allow you to preprocess data.
Background Knowledge Dialogue Dataset
Dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie.
Academic
Questions about the Microsoft Academic Search (MAS) database, derived by enumerating every logical query that could be expressed using the search page of the MAS website and writing sentences to match them.
Advising
Dataset contains questions regarding course information at the University of Michigan, but with fictional student records.
ATIS
Dataset is a collection of utterances to a flight booking system, accompanied by a relational database and SQL queries to answer the questions.
Break
Dataset contains 83,978 examples sampled from 10 question answering datasets over text, images and databases. Dataset used to obtain the Question Decomposition Meaning Representation (QDMR) for questions.
Coarse Discourse
Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API.
Complex Factoid Question Answering with Paraphrase Clusters (ComQA)
The dataset contains questions with various challenging phenomena such as the need for temporal reasoning, comparison (e.g., comparatives, superlatives, ordinals), compositionality (multiple, possibly nested, subquestions with multiple entities), and unanswerable questions.
GAP Coreference Dataset
Dataset contains 8,908 gender-balanced coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia.
GeoQuery
Dataset contains utterances issued to a database of US geographical facts.
PG-19
Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates.
Restaurants
Dataset contains user questions about restaurants, their food types, and locations.
Scholar
User questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct.
Trec CAR Dataset
Dataset contains topics, outlines, and paragraphs that are extracted from English Wikipedia (2016 XML dump). Wikipedia articles are split into the outline of sections and the contained paragraphs.
Wikipedia
The 2016-12-21 dump of English Wikipedia.
WikiSplit
Dataset contains 1 million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
WikiSQL
A large collection of automatically generated questions about individual tables from Wikipedia.
AG News
Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Groningen Meaning Bank
Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic.
Kensho Derived Wikimedia Dataset (KDWD)
Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base.
Language Modeling Broadened to Account for Discourse Aspects (LAMBADA)
Dataset contains narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
LitBank
Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.
QASC
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
Quoref
Dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions.
SemEval-2019 Task 9 - Subtask A
Suggestion Mining from Online Reviews and Forums: Dataset contains corpora of unstructured text with the intent for mining it for suggestions.
SemEval-2019 Task 9 - Subtask B
Suggestion Mining from Hotel Reviews: Dataset contains corpora of unstructured text with the intent for mining it for suggestions.
Sentences Involving Compositional Knowledge (SICK)
Dataset contains sentence pairs, generated from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description.
Wikidata NE dataset
Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases.
WikiText-103 & 2
Dataset contains word and character level tokens extracted from Wikipedia
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CLEVR & CoGenT)
Visual question answering dataset contains 100,000 images and 999,968 questions.
Abductive Natural Language Inference (aNLI)
Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations."
Common Objects in Context (COCO)
COCO is a large-scale object detection, segmentation, and captioning dataset. Dataset contains 330K images (>200K labeled) 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image.
Cornell Natural Language for Visual Reasoning (NLVR and NLVR2)
Dataset contains two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input.
Dialogue Natural Language Inference (NLI)
Dataset used to improve the consistency of a dialogue model. It consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C)."
EmoBank
Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme.
EmpatheticDialogues
Dataset of 25k conversations grounded in emotional situations.
Fact-based Visual Question Answering (FVQA)
Dataset contains image question anwering triples
HellaSwag
Dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene.
InsuranceQA
Dataset contains questions and answers collected from the website Insurance Library. It consists of questions from real world users, the answers with high quality were composed by professionals with deep domain knowledge. There are 16,889 questions in total.
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
OneCommon
Dataset contains 6,760 dialogues.
Physical IQA
Dataset is used for commonsense QA benchmark for naive physics reasoning focusing on how we interact with everyday objects in everyday situations. The dataset includes 20,000 QA pairs that are either multiple-choice or true/false questions.
QA-SRL Bank
Dataset contains question answer pairs for 64,000 sentences. Dataset is used to train model for semantic role labeling
QA-ZRE
Dataset contain question answer pairs with each instance containing a relation, a question, a sentence, and an answer set.
ReVerb45k, Base and Ambiguous
3 Datasets. In total, there are 91K triples.
Simplified Versions of the CommAI Navigation tasks (SCAN)
Dataset used for for studying compositional learning and zero-shot generalization. SCAN consists of a set of commands and their corresponding action sequences.
Social IQA
Dataset used fo question-answering benchmark for testing social commonsense intelligence.
Twitter Chat Corpus
Dataset contains Twitter question-answer pairs.
VisDial
Dataset contains images from COCO training set, and dialogues. Meant to be used for model to be trained in answering questions about images during conversation. Contains 1.2M dialog question-answers.
WinoGrande
Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
Affective Text
Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
DailyDialog
A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise.
Dataset for Intent Classification and Out-of-Scope Prediction
Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries.
DiscoFuse
Dataset contains examples for training sentence fusion models. Sentence fusion is the task of joining several independent sentences into a single coherent text. The data has been collected from Wikipedia and from Sports articles.
Emotion-Stimulus
Dataset annotated with both the emotion and the stimulus using FrameNetā€™s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Event2Mind
Dataset contains 25,000 events and free-form descriptions of their intents and reactions
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
Paraphrase Adversaries from Word Scrambling (PAWS)
Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Relation Extraction Corpus
A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution."
Soccer Dialogues
Dataset contains soccer dialogues over a knowledge graph
Social Media Mining for Health (SMM4H)
Dataset contains medication-related text classification and concept normalization from Twitter
Switchboard Dialogue Act Corpus (SwDA)
A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags
The Emotion in Text
Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger.
A Conversational Question Answering Challenge (CoQA)
Dataset for measuring the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.
A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP)
Dataset is used to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
ABC Australia News Corpus
Entire news corpus of ABC Australia from 2003 to 2019.
Activitynet-QA
Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal.
AI2 Reasoning Challenge (ARC)
Dataset contains 7,787 genuine grade-school level, multiple-choice science questions.
AI2 Science Questions Mercury
Dataset consists of questions used in student assessments across elementary and middle school grade levels. Includes questions with diagrams and without.
AI2 Science Questions v2.1
Dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
An Open Information Extraction Corpus (OPIEC)
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia containing more than 341M triples.
AQuA
Dataset containing algebraic word problems with rationales for their answers.
Aristo Tuple KB
Dataset contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.
arXiv Bulk Data
A collection of research papers on arXiv.
ASU Twitter Dataset
Twitter network data, not actual tweets. Shows connections between a large number of users.
Automated Essay Scoring
Dataset contains student-written essays with scores.
Automatic Keyphrase Extraction
Multiple datasets for automatic keyphrase extraction.
bAbI 20 Tasks
Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts.
babI 6 Tasks Dialogue
Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain.
BlogFeedback Dataset
Dataset to predict the number of comments a post will receive based on features of that post.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
BoolQ
Question answering dataset for yes/no questions.
Buzz in Social Media Dataset
Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.
Car Evaluation Dataset
Car properties and their overall acceptability.
Childrenā€™s Book Test (CBT)
Dataset contains ā€˜questionsā€™ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query.
Choice of Plausible Alternatives (COPA)
Dataset used for open-domain commonsense causal reasoning.
Clinical Case Reports for Machine Reading Comprehension (CliCR)
Dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity.
ClueWeb Corpora
Annotated web pages from the ClueWeb09 and ClueWeb12 corpora.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)
Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos.
CNN / Daily Mail Dataset
Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles.
Coached Conversational Preference Elicitation
Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
CommitmentBank
Dataset contains naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
COmmonsense Dataset Adversarially-authored by Humans (CODAH)
Commonsense QA in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions.
CommonsenseQA
Dataset contains multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers.
ComplexWebQuestions
Dataset includes pairs of simple questions and their corresponding SPARQL query. SPARQL queries were taken from WEBQUESTIONSSP and automatically created more complex queries that include phenomena such as function composition, conjunctions, superlatives and comparatives.
Conceptual Captions
Dataset contains ~3.3M images annotated with captions to be used for the task of automatically producing a natural-language description for an image.
Conversational Text-to-SQL Systems (CoSQL)
Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue version of the Spider and SParC tasks.
Cornell Movie--Dialogs Corpus
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie characters, involves 9,035 characters from 617 moviesin. total 304,713 utterances.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
Corporate Messaging Corpus
Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.
Cosmos QA
Dataset containing thousands of problems that require commonsense-based reading comprehension, formulated as multiple-choice questions.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).
Dataset for the Machine Comprehension of Text
Stories and associated questions for testing comprehension of text.
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with negotiations and complex communication.
DEXTER Dataset
Task given is to determine, from features given, which articles are about corporate acquisitions.
DVQA
Dataset containing data visualizations and natural language questions.
Enron Email Dataset
Emails from employees at Enron organized into folders.
Examiner Pseudo-News Corpus
Clickbait, spam, crowd-sourced headlines from 2010 to 2015.
Explanations for Science Questions
Data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines.
GQA
Question answering on image scene graphs.
Hansards Canadian Parliament
Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament.
Harvard Library
Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Hate Speech Identification Dataset
Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general.
Historical Newspapers Daily Word Time Series Dataset
Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922.
Home Depot Product Search Relevance
Dataset contains a number of products and real customer search terms from Home Depot's website.
HotpotQA
Dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
Human-in-the-loop Dialogue Simulator (HITL)
Dataset provides a framework for evaluating a botā€™s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog partner.
Jeapardy Questions Answers
Dataset contains Jeopardy questions, answers and other data.
Legal Case Reports
Federal Court of Australia cases from 2006 to 2009.
LibriSpeech ASR
Large-scale (1000 hours) corpus of read English speech.
Ling-Spam Dataset
Corpus contains both legitimate and spam emails.
Meta-Learning Wizard-of-Oz (MetaLWOz)
Dataset designed to help develop models capable of predicting user responses in unseen domains. It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains.
Microsoft Information-Seeking Conversation (MISC) dataset
Dataset contains recordings of information-seeking conversations between human ā€œseekersā€ and ā€œintermediariesā€. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort.
Microsoft Machine Reading COmprehension Dataset (MS MARCO)
Dataset focused on machine reading comprehension, question answering, and passage ranking, keyphrase extraction, and conversational search studies.
Microsoft Research Paraphrase Corpus (MRPC)
Dataset contains pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
Microsoft Research Social Media Conversation Corpus
A-B-A triples extracted from Twitter.
MovieLens
Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.
MovieTweetings
Movie rating dataset based on public and well-structured tweets.
MSParS
Dataset for the open domain semantic parsing task.
Multi-Domain Wizard-of-Oz Dataset (MultiWoz)
Dataset of human-human written conversations spanning over multiple domains and topics. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk.
Multimodal Comprehension of Cooking Recipes (RecipeQA)
Dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images.
MultiNLI Matched/Mismatched
Dataset contains sentence pairs annotated with textual entailment information.
MutualFriends
Task where two agents must discover which friend of theirs is mutual based on the friend's attributes.
NarrativeQA
Dataset contains the list of documents with Wikipedia summaries, links to full stories, and questions and answers.
Natural Questions (NQ)
Dataset contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question.
News Headlines Dataset for Sarcasm Detection
High quality dataset with Sarcastic and Non-sarcastic news headlines.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NewsQA
Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN.
NPS Chat Corpus
Posts from age-specific online chat rooms.
NUS SMS Corpus
SMS messages collected between 2 users, with timing analysis.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenBookQA
Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
OpinRank Review Dataset
Reviews of cars and hotels from Edmunds.com and TripAdvisor.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personalized Dialog
Dataset of dialogs from movie scripts.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
ProPara Dataset
Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process.
QuaRel Dataset
Dataset contains 2,771 story questions about qualitative relationships.
QuaRTz Dataset
Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
Quasar-S & T
The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources.
Question Answering in Context (QuAC)
Dataset for modeling, understanding, and participating in information seeking dialog.
Question NLI
Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context.
Quora Question Pairs
The task is to determine whether a pair of questions are semantically equivalent.
ReAding Comprehension Dataset From Examinations (RACE)
Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning.
Reading Comprehension over Multiple Sentences (MultiRC)
Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
Reading Comprehension with Commonsense Reasoning Dataset (Record)
Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles.
Reading Comprehension with Multiple Hops (Qangaroo)
Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers).
Recognizing Textual Entailment (RTE)
Datasets are combined and converted to two-class classification: entailment and not_entailment.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Schema-Guided Dialogue State Tracking (DSTC 8)
Dataset contains 18K dialogues between a virtual assistant and a user.
SciQ Dataset
Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each.
SciTail Dataset
Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
SearchQA
Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average.
Semantic Parsing in Context (SParC)
Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task.
Semantic Textual Similarity Benchmark
The task is to predict textual similarity between sentence pairs.
SemEvalCQA
Dataset for community question answering.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Shaping Answers with Rules through Conversation (ShARC)
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
Short Answer Scoring
Student-written short-answer responses.
Situations With Adversarial Generations (SWAG)
Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
SNAP Social Circles: Twitter Database
Large Twitter network data.
Social-IQ Dataset
Dataset containing videos and natural language questions for visual reasoning.
Spambase Dataset
Dataset contains spam emails.
Spider 1.0
Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
SQuAD v2.0
Paragraphs w/ questions and answers.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Stanford Natural Language Inference (SNLI) Corpus
Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.
T-REx
Dataset contains Wikipedia abstracts aligned with Wikidata entities.
TabFact
Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence.
Taskmaster-1
Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Textbook Question Answering
The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images.
TextVQA
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
The Benchmark of Linguistic Minimal Pairs (BLiMP)
BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
The Conversational Intelligence Challenge 2 (ConvAI2)
A chit-chat dataset based on PersonaChat dataset.
The Corpus of Linguistic Acceptability (CoLa)
Dataset used to classifiy sentences as grammatical or not grammatical.
The Dialog-based Language Learning Dataset
Dataset was designed to measure how well models can perform at learning as a student given a teacherā€™s textual responses to the studentā€™s answer.
The Irish Times IRS
Dataset contains 23 years of events from Ireland.
The Movie Dialog Dataset
Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion).
The Penn Treebank Project
Naturally occurring text annotated for linguistic structure.
The SimpleQuestions Dataset
Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
The Story Cloze Test | ROCStories
Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story.
The WikiMovies Dataset
Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia.
Topical-Chat
A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners donā€™t have explicitly defined roles.
Total-Text-Dataset
Dataset used to classify curved text in pictures.
TrecQA
Dataset is commonly used for evaluating answer selection in question answering.
TriviaQA
Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average.
TupleInf Open IE Dataset
Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T).
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Twitter100k
Pairs of images and tweets.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
Urban Dictionary Dataset
Corpus of words, votes and definitions.
UseNet Corpus
UseNet forum postings.
Video Commonsense Reasoning (VCR)
Dataset contains 290K multiple-choice questions on 110K images.
Visual QA (VQA)
Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer.
Voices Obscured in Complex Environmental Settings (VOiCES)
Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
WebQuestions Semantic Parses Dataset
Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and ā€œpartialā€ annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer.
Who Did What Dataset
Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
News Headlines Dataset for Sarcasm Detection
High quality dataset with Sarcastic and Non-sarcastic news headlines.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NewsQA
Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN.
NPS Chat Corpus
Posts from age-specific online chat rooms.
NUS SMS Corpus
SMS messages collected between 2 users, with timing analysis.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenBookQA
Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
OpinRank Review Dataset
Reviews of cars and hotels from Edmunds.com and TripAdvisor.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personalized Dialog
Dataset of dialogs from movie scripts.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
ProPara Dataset
Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process.
QuaRel Dataset
Dataset contains 2,771 story questions about qualitative relationships.
QuaRTz Dataset
Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
Quasar-S & T
The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources.
Question Answering in Context (QuAC)
Dataset for modeling, understanding, and participating in information seeking dialog.
Question NLI
Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context.
Quora Question Pairs
The task is to determine whether a pair of questions are semantically equivalent.
ReAding Comprehension Dataset From Examinations (RACE)
Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning.
Reading Comprehension over Multiple Sentences (MultiRC)
Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
Reading Comprehension with Commonsense Reasoning Dataset (Record)
Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles.
Reading Comprehension with Multiple Hops (Qangaroo)
Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers).
Recognizing Textual Entailment (RTE)
Datasets are combined and converted to two-class classification: entailment and not_entailment.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Schema-Guided Dialogue State Tracking (DSTC 8)
Dataset contains 18K dialogues between a virtual assistant and a user.
SciQ Dataset
Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each.
SciTail Dataset
Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
SearchQA
Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average.
Semantic Parsing in Context (SParC)
Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task.
Semantic Textual Similarity Benchmark
The task is to predict textual similarity between sentence pairs.
SemEvalCQA
Dataset for community question answering.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Shaping Answers with Rules through Conversation (ShARC)
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
Short Answer Scoring
Student-written short-answer responses.
Situations With Adversarial Generations (SWAG)
Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
SNAP Social Circles: Twitter Database
Large Twitter network data.
Social-IQ Dataset
Dataset containing videos and natural language questions for visual reasoning.
Spambase Dataset
Dataset contains spam emails.
Spider 1.0
Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
SQuAD v2.0
Paragraphs w/ questions and answers.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Stanford Natural Language Inference (SNLI) Corpus
Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.
T-REx
Dataset contains Wikipedia abstracts aligned with Wikidata entities.
TabFact
Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence.
Taskmaster-1
Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Textbook Question Answering
The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images.
TextVQA
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
The Benchmark of Linguistic Minimal Pairs (BLiMP)
BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
The Conversational Intelligence Challenge 2 (ConvAI2)
A chit-chat dataset based on PersonaChat dataset.
The Corpus of Linguistic Acceptability (CoLa)
Dataset used to classifiy sentences as grammatical or not grammatical.
The Dialog-based Language Learning Dataset
Dataset was designed to measure how well models can perform at learning as a student given a teacherā€™s textual responses to the studentā€™s answer.
The Irish Times IRS
Dataset contains 23 years of events from Ireland.
The Movie Dialog Dataset
Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion).
The Penn Treebank Project
Naturally occurring text annotated for linguistic structure.
The SimpleQuestions Dataset
Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
The Story Cloze Test | ROCStories
Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story.
The WikiMovies Dataset
Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia.
Topical-Chat
A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners donā€™t have explicitly defined roles.
Total-Text-Dataset
Dataset used to classify curved text in pictures.
TrecQA
Dataset is commonly used for evaluating answer selection in question answering.
TriviaQA
Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average.
TupleInf Open IE Dataset
Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T).
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Twitter100k
Pairs of images and tweets.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
Urban Dictionary Dataset
Corpus of words, votes and definitions.
UseNet Corpus
UseNet forum postings.
Video Commonsense Reasoning (VCR)
Dataset contains 290K multiple-choice questions on 110K images.
Visual QA (VQA)
Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer.
Voices Obscured in Complex Environmental Settings (VOiCES)
Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
WebQuestions Semantic Parses Dataset
Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and ā€œpartialā€ annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer.
Who Did What Dataset
Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
WikiQA Corpus
Dataset contains Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. 
Winogender Schemas
Dataset with pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
Words in Context
Dataset for evaluating contextualized word representations.
Yahoo! Music User Ratings of Musical Artists
Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems or collaborative filtering algorithms.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.
YouTube Comedy Slam Preference Dataset
User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Open Table-and-Text Question Answering (OTT-QA)
Dataset contains open questions which require retrieving tables and text from the web to answer. The dataset is built on the HybridQA dataset.
Taskmaster-3
Dataset consists of 23,757 movie ticketing dialogs. "Movie ticketing" is defined as conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction.
STAR
A schema-guided task oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog.
Zero Shot Learning from Task Descriptions (ZEST)
Dataset used for zero-shot prediction that is formatted similarly to reading comprehension datasets, where the authors formulate task descriptions as questions and pair them with paragraphs of text.
Sentimental LIAR
Sentimental LIAR dataset is a modified and further extended version of the original LIAR dataset. It was modified to be a binary-label dataset that was then extended by adding sentiments derived using the Google NLP API.
FinancialPhraseBank
Dataset contains the sentiments for financial news headlines from the perspective of a retail investor.
WebChild
Dataset contains triples that connect nouns with adjectives via fine-grained relations like hasShape, hasTaste, evokesEmotion, etc. The arguments of these assertions, nouns and adjectives, are disambiguated by mapping them onto their proper WordNet senses.
HOVER
Dataset is an open-domain, many-hop fact extraction and claim verification dataset built upon the Wikipedia corpus. The original 2-hop claims are adapted from question-answer pairs from HotpotQA.
Stack Overflow Question-Code Pairs (StaQC)
Dataset contains 148K Python and 120K SQL domain question-code pairs, which were mined from Stack Overflow.
IIRC
Dataset contains more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents.
Spoken Language Understanding Resource Package (SLURP)
Dataset is a collection of ~72K audio recordings of single turn user interactions with a home assistant, annotated with three levels of semantics: Scenario, Action and Entities, including over 18 different scenarios, with 46 defined actions and 55 different entity types.
SubjQA
Dataset is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer).
MAVEN
Dataset contains 4,480 Wikipedia documents, 118,732 event mention instances, and 168 event types.
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
PerSenT
Dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3K documents and 38K paragraphs covering 3.2K unique entities.
MK-SQuIT
Dataset contains 110,000 English question and SPARQL query pairs across four WikiData domains.
PheMT
Dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena: Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant.
EHR-Rel
A benchmark dataset for biomedical concept relatedness, consisting of 3,630 concept pairs sampled from electronic health records (EHRs).
2WikiMultihopQA
A multihop QA dataset, which uses structured and unstructured data. It includes the evidence information containing a reasoning path for multi-hop questions.
CoNLL 2003 ++
Similar to the original CoNLL except test set has been corrected for label mistakes. The dataset is split into training, development, and test sets, with 14,041, 3,250, and 3,453 instances respectively.
Open-Retrieval Conversational Question Answering (ORConvQA)
Dataset enhances QuAC by adapting it to an open retrieval setting. It is an aggregation of 3 existing datasets: (1) the QuAC dataset that offers information-seeking conversations, (2) the CANARD dataset that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions.
KB-Ref
Dataset is a referring expression comprehension dataset containing 43K expressions on 16K images. Different with other referring expression dataset, it requires that each referring expression must use at least one external knowledge (the information can not be got from the image).
ACL Citation Coreference Corpus
Dataset was constructed from papers from proceedings of the ACL conference in 2007 and 2008. Text was annotated for the coreference resolution task.
COMETA
Dataset is an entity linking dataset of layman medical terminology. It consists of 20K English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph.
WNUT 2016
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format.
GrailQA
Dataset contains 64,331 crowdsourced questions involving up to 4 relations and functions like counting, comparatives, and superlatives. The dataset covers all the 86 domains in Freebase Commons.
ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Social Narrative Tree
Dataset contains 1,250 stories documenting a variety of daily social interactions.
Multi-Xscience
A multi-document summarization dataset created from scientific articles. MultiXScience introduces a challenging multidocument summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of āˆ¼18M sentences spanning āˆ¼45M triples with āˆ¼1,500 distinct relations from English Wikidata.
TriageSQL
Dataset is a cross-domain text-to-SQL question intention classification benchmark. It contains 34K databases and 390K questions from 20 existing datasets.
TweetEval
TweetEval consists of seven tasks in Twitter, all framed as multi-class tweet classification. Emotion Recognition, Emoji Prediction, Irony Detection, Hate Speech Detection, Offensive Language Identification, Sentiment Analysis, & Stance Detection.
Tree-Based Dialog State Tracking (TreeDST)
Dataset is a multi-turn, multi-domain task-oriented dialog dataset annotated with tree-based user dialog states and system dialog acts. The goal is to provide a novel solution for end-to-end dialog state tracking as a conversational semantic parsing task. In total, it contains 27,280 conversations covering 10 domains with shared types of person, time and location.
Acronym Detection Dataset
Dataset contains 62,441 samples where each sample involves a sentence, an ambiguous acronym, and its correct meaning. Samples came from scientific papers from arXiv.
Acronym Identification
Task is to to find the acronyms and the phrases that have been abbreviated by the acronyms in the document.
CC100-English
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 82G.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Open Table-and-Text Question Answering (OTT-QA)
Dataset contains open questions which require retrieving tables and text from the web to answer. The dataset is built on the HybridQA dataset.
Taskmaster-3
Dataset consists of 23,757 movie ticketing dialogs. "Movie ticketing" is defined as conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction.
STAR
A schema-guided task oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog.
Scruples
Scruples contains 2 datasets: Anecdotes and Dilemmas. Anecdotes contains 32,000 real-life anecdotes about complex ethical situations, with 625,000 ethical judgments extracted from reddit. Dilemmas contains 10,000 ethical dilemmas in the form of paired actions, where the model must identify which one was considered less ethical by crowd workers on Mechanical Turk.
Elsevier OA CC-BY
Dataset contains 40, 091 open access (OA) CC-BY articles from across Elsevierā€™s journals.
Microsoft News Dataset (MIND)
Dataset contains ~160k English news articles and more than 15 million impression logs generated by 1 million users.
Situated and Interactive Multimodal Conversations (SIMMC)
There are 2 datasets totalling āˆ¼13K human-human dialogs (āˆ¼169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images).
HINT3
In total there are three datasets: SOFMattress, Curekart and Powerplay11 with each containing diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming respectively. Each datasets spans multiple coarse and fine grain intents, with the test sets being drawn entirely from actual user queries on live systems at scale instead of being crowdsourced.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
NatCat
Dataset contains naturally annotated category-text pairs for training text classifiers derived from 3 sources: Wikipedia, Reddit, and Stack Exchange.
DialoGLUE
Benchmark for task oriented dialogue containing 7 datasets: Banking77 containing online banking queries, HWU64 containing popular personal assistant queries, CLINC150 containing popular personal assistant queries, Restaurant8k containing restaurant booking domain queries, DSTC8 SGD containing multi-domain, task-oriented conversations between a human and a virtual assistant, and TOP containing compositional queries for hierachical semantic representations, MultiWOZ 2.1 12K multi-domain dialogues with multiple turns.
Offensive Language Identification Dataset (OLID)
Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.
Business Scene Dialogue (BSD)
Dataset contains 955 scenarios, 30,000 parallel sentences in English-Japanese.
English Possible Idiomatic Expressions (EPIE)
Dataset containing 25,206 sentences labelled with lexical instances of 717 idiomatic expressions.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
Inquisitive
Dataset contains āˆ¼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
Constructive Comments Corpus (C3)
Dataset is a subset of comments from the SFU Opinion and Comments Corpus. This subset, the Constructive Comments Corpus (C3) consists of 12,000 comments annotated by crowdworkers.
Numeric Fused-Heads
Dataset contains annotated sentences of numeric-fused-heads, along with their "missing head". A number refers to an implicit (and not explicitly provided) reference. For example, in the sentence "I miss being 10", the number 10 refers to the age of 10, but is not explicitly said.
LEDGAR
LEDGAR is a multilabel corpus of legal provisions in contracts suited for text classification in the legal domain (legaltech). It features over 1.8M+ provisions and a set of 180K+ labels. A smaller, cleaned version of the corpus is also available.
Olpbench
Dataset contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities. Dataset is used for open link prediction task.
FewRel 1.0
Dataset is a few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains.
SemEval2010 Task 8
Dataset consists of 8,000 sentences annotated for Cause-Effect , Instrument-Agency, Product-Producer, Content-Container, Entity-Origin, Entity-Destination, Component-Whole, Member-Collection, and Message-Topic.
KnowledgeNet
KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web.
Evidence Inference
Dataset contains 10,137 annotated prompts for 2,419 unique article with the task of inferring whether a given clinical treatment is effective with respect to a specified outcome. The dataset provides a prompt that specifies an intervention, a comparator, and an outcome, along with a fulltext article. The model is then used to infer the reported findings with respect to this prompt.
DailyDialog++
DailyDialog++ is an open-domain dialogue evaluation dataset consisting of 19k contexts with five relevant responses for each context. Additionally for 11k contexts, it includes five adversarial irrelevant responses which are specifically crafted to have lexical or semantic overlap with the context but are still unacceptable as valid responses.
LogiQA
Dataset consists of 8,678 QA instances, covering multiple types of deductive reasoning. Multiple-choice.
SCDE
Dataset of sentence-level cloze questions sourced from public school examinations. Each instance consists of a passage with multiple sentence-level blanks and a shared set of candidates. Besides the right answer to each cloze in the passage, the candidate set also contains ones which donā€™t answer any cloze, called distractors. [requires contacting authors for data]
CoDEx 
Three graph datasets containing positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.
QED
Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection, referential equality, and predicate entailment. Dataset was built as a subset of the Natural Questions dataset.
SMCalFlow
Dataset contains natural conversations about tasks involving calendars, weather, places, and people. Each turn is annotated with an executable dataflow program featuring API calls, function composition, and complex constraints built from strings, numbers, dates and times.
Critical Role Dungeons and Dragons Dataset (CRD3)
Dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding abstractive summaries collected from the Fandom wiki. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons.
The Semantic Scholar Open Research Corpus (S2ORC)
Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.
Wiki-CS
Dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field.
Semantic Parsing with Language Assistance from Humans (SPLASH)
Dataset enables text-to-SQL systems to seek and leverage human feedback to further improve the overall performance and user experience. Dataset contains 9,314 question-feedback pairs, 8,352 of which, correspond to questions in the Spider training split and 962 from the spider development split.
ClariQ
Dataset consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations.
DialogRE
Dataset contains human-annotated dialogue-based relation extraction containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy "Friends". There are 36 possible relation types that exist between an argument pair in a dialogue.
Visual Genome
Dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects.
CraigsListBargain
Dataset contains 6,682 human-human dialogues where 2 agents negotiate the sale/purchase of an item.
A Multi-Turn, Multi-Domain Dialogue Dataset (KVRET)
Dataset contains 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation.
CMU_ARCTIC
Dataset contains 1,150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.
L2-ARTIC
Dataset includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMUā€™s ARCTIC prompts.
ACL Anthology Reference Corpus (ACL ARC)
Dataset contains 10,921 articles from the February 2007 snapshot of the Anthology; text and metadata for the articles were extracted, consisting of BibTeX records derived either from the headers of each paper or from metadata taken from the Anthology website.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
TalkDown
Dataset used for classifying condescending acts in context. Dataset was extracted from Reddit COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending.
Implicature and Presupposition Diagnostic dataset (IMPPRES)
Dataset contains semiautomatically generated sentence pairs illustrating well-studied pragmatic inference types. IMPPRES follows the format of SNLI, MultiNLI and XNLI, which was created to evaluate how well trained NLI models recognize several classes of presuppositions and scalar implicatures.
Abstractive Sentence Simplification Evaluation and Tuning (ASSET)
Dataset consists of 23,590 human simplifications associated with the 2,359 original sentences from TurkCorpus (10 simplifications per original sentence).
Visual Commonsense Graphs
Dataset consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 59,000 images, each paired with short video summaries of before and after.
Story Commonsense
Dataset contains a total of 300k low-level annotations for motivation and emotion across15,000 stories (randomly selected from the ROC story training set). It covers over 150,000 character-line pairs, in which 56k character-line pairs have an annotated motivation and 105k have an annotated change in emotion (i.e. a label other than none).
StereoSet
Dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.
Hippocorpus
Dataset of 6,854 English diary-like short stories about recalled and imagined events.
Web Demonstration and Explanation Dataset (Web-D-E)
Dataset consists of 520 explanations and corresponding demonstrations of web-based tasks from the Mini Word-of-Bits.
GeNeVA
Data contains the CoDraw and i-CLEVR datasets used for the Generative Neural Visual Artist (GeNeVA) task.
FigureQA
Dataset is a visual reasoning corpus of over one million question answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.
BioCreative II Gene Mention Recognition (BC2GM)
Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access]
BC5CDR Drug/Chemical (BC5-Chem)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
BC5CDR Disease (BC5-Disease)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
JNLPBA
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. 
NCBI Disease Corpus
Dataset contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier.
EBM PICO
Dataset contains ~5,000 medical abstracts describing clinical trials, annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels).
ChemProt
ChemProt [is] a disease chemical biology database, which is based on a compilation of multiple chemicalā€“protein annotation resources, as well as disease-associated proteinā€“protein interactions (PPIs). [registration required for access]
Drug-Disease Interaction (DDI)
Dataset contains 792 texts selected from the DrugBank database and other 233 Medline abstracts. This fined-grained corpus has been annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both PK as well as PD interactions.
Gene-Disease Associations (GAD)
Dataset is an archive of published genetic association studies that provides a comprehensive, public, web-based repository of molecular, clinical and study parameters for >5,000 human genetic association studies at this time.
BIOSSES
Dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. TAC dataset consists of 20 articles (reference articles) and citing articles that vary from 12 to 20 for each of the reference articles.
HoC (Hallmarks of Cancer)
Dataset consists of 1,852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy.
PubmedQA
A biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe.
TRACT: Tweets Reporting Abuse Classification Task Corpus
Dataset used for multi-class classification task involving three classes of tweets that mention abuse reportings: "report" (annotated as 1); "empathy" (annotated as 2); and "general" (annotated as 3).
Hateful Memes
Dataset is used to detect hateful memes. In total, the datset contains 10,000 memes comprising of five different types: multimodal hate, where benign confounders were found for both modalities, unimodal hate, where one or both modalities were already hateful on their own, benign image, benign text confounders and finally random not-hateful examples.
Adverse Drug Effect (ADE) Corpus
There's 3 different datasets: DRUG-AE.rel provides relations between drugs and adverse effects, DRUG-DOSE.rel provides relations between drugs and dosages and ADE-NEG.txt provides all sentences in the ADE corpus that DO NOT contain any drug-related adverse effects.
MEDIQA-Answer Summarization
Dataset containing question-driven summaries of answers to consumer health questions.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
Wikipedia Current Events Portal (WCEP) Dataset
Dataset is used for multi-document summarization (MDS) and consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event.
Worldtree Corpus
Dataset contains multi-hop question answering/explanations where questions require combining between 1 and 16 facts (average 6) to generate detailed explanations for question answering inference. Each explanation is represented as a lexically-connected ā€œexplanation graphā€ that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables.
ScienceExamCER
Dataset contains 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms.
LibriMix
Dataset is used for speech source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it.
TVQA
Dataset is used for video question answering and consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video.
MovieQA
Dataset used to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies.
TGIF-QA
Dataset consists of 165K QA pairs from 72K animated GIFs. Used for video question answering.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ArxivPapers
Dataset is a corpus of over 100,000 scientific papers related to machine learning.
SegmentedTables & LinkedResults
Dataset mentions in captions, the type of table (leaderboard, ablation, irrelevant) and ground truth cell annotations into classes: dataset, metric, paper model, cited model, meta and task.
CodeSearchNet Corpus
Dataset contains functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.
CompGuessWhat?!
Dataset contains 65,700 dialogues based on GuessWhat?! dataset dialogues and enhanced by including object attributes coming from resources such as VISA attributes, VisualGenome and ImSitu.
Gigaword
Dataset contains headline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles.
Opinosis
Dataset contains sentences extracted from reviews for 51 topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.
BillSum
Dataset contains a summarization of US Congressional and California state bills.
SAMSum
Dataset contains over 16K chat dialogues with manually annotated summaries.
Annotated Enron Subject Line Corpus (AESLC)
Dataset contains email messages of employees in the Enron Corporation.
Multi-News
Dataset consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
News Category Dataset
Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
NIPS Papers
Dataset contains the title, authors, abstracts, and extracted text for all NIPS papers between 1987-2016.
CSTR VCTK Corpus
Dataset contains speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
Open Resource for Click Analysis in Search (ORCAS)
ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
ClarQ
Dataset consists of āˆ¼2M question/post tuples distributed across 173 domains of stackexchange.
ManyModalQA
Dataset contains 10,190 questions, 2,873 images, 3,789 text, and 3,528 tables scraped from Wikipedia.
Polusa
Dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum.
DocBank
Dataset contains fine-grained token-level annotations for document layout analysis. It includes 5,053 documents and both the validation set and the test set include 100 documents.
Get it #OffMyChest
Dataset is used for affective understanding of conversations focusing on the problem of how speakers use emotions to react to a situation and to each other. Posts were taken from the 2018 top reddit posts from /r/CasualConversations and /r/OffMyChest.
AirDialogue
Dataset contains 402,038 goal-oriented conversations.
WSJ0 Hipster Ambient Mixtures (WHAM!)
Dataset consists of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area.
Crema-D
Dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected.
LibriTTS
Dataset is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.
Ljspeech
Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Surrey Audio-Visual Expressed Emotion (SAVEE)
Dataset consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion.
Civil Comments
Dataset contains the archive of the Civil Comments platform. Dataset was annotated for toxicity.
Common Sense Explanations (CoS-E)
Dataset used to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework.
e-SNLI
Dataset contains human-annotated natural language explanations of the entailment relations.
1 Billion Word Language Model Benchmark (lm1b)
Dataset used for measuring progress in statistical language modeling.
Math Dataset
Dataset contains mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
SciCite
Dataset used for classifying citation intents in academic papers. The main citation intent label for each JSON object is specified with the label key while the citation context is specified in with a context key.
WordNet
Dataset is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
Yelp Polarity Reviews
Dataset contains 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. Dataset from FastAI's website.
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
MMD
Dataset contains over 150K conversation sessions between shoppers and sales agents.
ParaBank
Dataset contains paraphrases with 79.5 million references and on average 4 paraphrases per reference.
Humicroedit
Dataset contains 15,095 edited news headlines and their numerically assessed humor.
VQA-Introspect
Dataset consists of 238K new perception questions from the VQA dataset which serve as sub questions corresponding to the set of perceptual tasks needed to answer complex reasoning questions.
Talk the Walk
Dataset consists of over 10k crowd-sourced dialogues in which two human annotators collaborate to navigate to target locations in the virtual streets of NYC.
FB15K-237 Knowledge Base Completion Dataset
Dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs.
WN18RR
Dataset contains knowledge base relation triples from WordNet.
AmbigNQ
Dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark.
OpenKeyPhrase (OpenKP)
Open domain keyphrase extraction dataset containing 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases.
Flickr30K Entities
Dataset contains 244k coreference chains and 276k manually annotated bounding boxes for each of the 31,783 images and 158,915 English captions (five per image) in the original dataset.
Street View Text (SVT)
Dataset contains images with textual content used for scene text recognition.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
Rotowire and SBNation Datasets
Dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box and line scores.
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
LogicNLG
Dataset is a table-based factchecking dataset with rich logical inferences in the annotated statements.
FakeNewsNet
Repo contains two datasets with news content, social context, and spatiotemporal information from Politifact and Gossipcop.
LIAR Dataset
Dataset contains 12.8K manually labeled short statements in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case.
Dialogue-Based Reading Comprehension Examination (DREAM)
Dataset contains 10,197 multiple choice questions for 6,444 dialogues, collected from English-as-a-foreign-language examinations designed by human experts. DREAM is likely to present significant challenges for existing reading comprehension systems: 84% of answers are non-extractive, 85% of questions require reasoning beyond a single sentence, and 34% of questions also involve commonsense knowledge.
The New York Times Annotated Corpus
Dataset contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom.
BigPatent
Dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
Libri-Light
Dataset contains 60K hours of unlabelled speech from audiobooks in English and a small labelled data set (10h, 1h, and 10 min).
Atlas of Machine Commonsense (ATOMIC)
Dataset is a knowledge graph of 877K textual description triples of inferential knowledge.
Genia
Dataset contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated for part-of-speech, contituency syntactic, terms, events, relations, and coreference.
DNA Methylation Corpus
Dataset contains 200 abstracts including a representative sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for nearly 3,000 gene/protein mentions and 1,500 DNA methylation and demethylation events.
Exhaustive PTM Corpus
Dataset contains 360 abstracts manually annotated in the BioNLP Shared Task event representation for over 4,500 mentions of proteins and 1,000 statements of modification events of nearly 40 different types.
mTOR Pathway Corpus
Dataset contains 1,300 annotated event instances of protein associations and dissociation reactions.
PTM Event Corpus
Dataset contains 157 PubMed abstracts annotated for over 1,000 proteins and 400 post-translational modification events identifying the modified proteins and sites.
T4SS Event Corpus
Dataset contains 27 full text publications totaling 15,143 pseudo-sentences (text sentences plus table rows, references, etc.) and 244,942 tokens covering 4 classes: Bacteria, Cellular components, Biological Processes, and Molecular functions.
Abstract Meaning Respresentation (AMR) Bank
Dataset contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text.
Adversarial NLI (ANLI)
Dataset is an NLI benchmark created via human-and-model-in-the-loop enabled training (HAMLET). Human was tasked to provide a hypothesis that fools the model into misclassifying the label.
SCITLDR
Dataset of a combination of TLDRs written by human experts and author written TLDRs of computer science papers from OpenReview.
Self-Annotated Reddit Corpus (SARC)
Dataset contains 1.3 million sarcastic comments from the Internet commentary website Reddit. It contains statements, along with their responses as well as many non-sarcastic comments from the same source.
Fact Extraction and Verfication (FEVER)
Dataset contains 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as supported, rufted or notenoughinfo.
DuoRC
Dataset contains 186,089 unique question-answer pairs created from a collection of 7,680 pairs of movie plots where each pair in the collection reflects two versions of the same movie.
ColBERT
Dataset contains 200k short texts (100k positive, 100k negative). Used for humor detection.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
Logic2Text
Dataset contains 5,600 tables and 10,753 descriptions involving common logic types paired with the underlying logical forms.
OneStopQA
Dataset comprises 30 articles from the Guardian in 3 parallel text difficulty versions and contains 1,458 paragraph-question pairs with multiple choice questions, along with manual span markings for both correct and incorrect answers.
Audio Visual Scene-Aware Dialog (AVSD)
Dataset consists of text-based human conversations about short videos from the Charades dataset.
Will-They-Won't-They (WT-WT)
Dataset of English tweets targeted at stance detection for the rumor verification task.
SciREX
Dataset is fully annotated with entities, their mentions, their coreferences, and their document level relations.
GoEmotions
Dataset contains 58K carefully curated Reddit comments labeled for 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, & surprise.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
DoQa
Dataset contains domain specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies.
Personal Events in Dialogue Corpus
Dataset is a corpus containing annotated dialogue transcripts from fourteen episodes of the podcast This American Life. It contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 represent events.
Quda
Dataset contains 14,035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art machine/deep learning techniques for parsing complex human language.
DramaQA
Dataset contains 16,191 question answer pairs from 23,928 various length video clips, with each question answer pair belonging to one of four difficulty levels.
Statutory Reasoning Assessment (SARA)
Dataset contains a set of rules extracted from the statutes of the US Internal Revenue Code (IRC), together with a set of natural language questions which may only be answered correctly by referring to the rules.
InfoTabs
Dataset contains human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.
emrQA
Dataset contains 1M question-logical form and 400,000+ question answer evidence pairs on electronic medical records. In total, there are 2,495 clinical notes.
Credbank
Dataset comprises more than 60M tweets grouped into 1,049 real-world events, each annotated by 30 human annotators.
BuzzFace
Dataset focused on news stories (which are annotated for veracity) posted to Facebook during September 2016 consisting of: Nearly 1.7 million Facebook comments discussing the news content, Facebook plugin comments, Disqus plugin comments, Associated webpage content of the news articles.
Some Like it Hoax
Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific).
Focused Open Biology Information Extraction (FOBIE)
Dataset contains 1,500 manually-annotated sentences that express domain-independent relations between central concepts in a scientific biology text, such as trade-offs and correlations.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
FreebaseQA
Dataset contains 28,348 unique questions for open domain QA over the Freebase knowledge graph.
Deft
Dataset contains annotated content from two different data sources: 1) 2,443 sentences from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences from open source textbooks including topics in biology, history, physics, psychology, economics, sociology, and government.
Clash of Clans
Dataset contains 50K user comments, both from the iTunes App Store and Google Play. The dataset spans from Oct 18, 2018 to Feb 1, 2019.
IRC Disentanglement
Dataset contains 77,563 messages of internet relay chat (IRC). Almost all are from the Ubuntu IRC Logs.
Action Learning From Realistic Environments and Directives (ALFRED)
Dataset contains 8k+ expert demostrations with 3 or more language annotations each comprising of 25,000 language directives. A trajectory consists of a sequence of expert actions, the corresponding image observations, and language annotations describing segments of the trajectory.
Visual Storytelling Dataset (VIST)
Dataset contains 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).
All the News 2.0
Dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Datasets Knowledge Embedding
Several datasets containing edges and nodes for knowledge base building.
Frames
Dataset contains 1,369 human-human dialogues with an average of 15 turns per dialogue. This corpus contains goal-oriented dialogues between users who are given some constraints to book a trip and assistants who search a database to find appropriate trips.
SemEval-2014 Task 3
Dataset is used for cross-level semantic similarity which measures the degree to which the meaning of a larger linguistic item, such as a paragraph, is captured by a smaller item, such as a sentence.
SemEval-2019 Task 6 
Dataset containing tweets as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B ā€“ C).
WNUT 2017
Dataset containing tweets, reddit comments, YouTube comments, and StackExchange were annotated with 6 entities: Person, Location, Corporation, Consumer good, Creative work, and Group.
MPQA Opinion Corpus
Dataset contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
Ohsumed Dataset
Dataset containing references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991).
SelQA
Dataset provides crowdsourced annotation for two selection-based question answer tasks, answer sentence selection and answer triggering. Our dataset composes about 8K factoid questions for the top-10 most prevalent topics among Wikipedia articles.
PubMed 200k RCT Dataset
Dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences.
Textual Visual Semantic Dataset
A dataset consisting of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). Around 82,000 images.
EventQA
A dataset for answering Event-Centric questions over Knowledge Graphs (KGs). It contains 1,000 semantic queries and the corresponding verbalisations.
Sequential Question Answering (SQA)
Dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.
WikiTablesQuestions
Dataset is for the task of question answering on semi-structured HTML tables.
DocRed
Dataset was constructed from Wikipedia and Wikidata. It annotates both named entities and relations.
Complex Sequential Question Answering (CSQA)
Dataset contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.
Linked WikiText-2
Dataset contains over 2 million tokens from Wikipedia articles, along with annotations linking mentions to their corresponding entities and relations in Wikidata.
OpenDialKG
Dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. Each dialog turn is paired with its corresponding ā€œKG pathsā€ that weave together the KG entities and relations that are mentioned in the dialog.
BuGL
Dataset consists of 54 GitHub projects of four different programming languages namely C, C++, Java and Python with around 10,187 issues.
HybridQA
Dataset contains over 70K question-answer pairs based on 13,000 tables, each table is in average linked to 44 passages.
PoKi
Dataset is a corpus of 61,330 poems written by children from grades 1 to 12.
MuTual
Retrieval-based dataset for multi-turn dialogue reasoning, which is modified from Chinese high school English listening comprehension test data.
ToTTo
Dataset is used for the controlled generation of descriptions of tabular data comprising over 100,000 examples. Each example is a aligned pair of a highlighted table and the description of the highlighted content.
VIdeO-and-Language INference (VIOLIN)
Dataset contains 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows). Inference descriptions of video content were annotated. Inferences are used to measure entailment vs video clip.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
ReClor
Dataset contains logical reasoning questions of standardized graduate admission examinations.
Compositional Freebase Questions (CFQ)
Dataset contains questions and answers that also provides for each question a corresponding SPARQL query against the Freebase knowledge base.
MoviE Text Audio QA (MetaQA)
Dataset contains more than 400K questions for both single and multi-hop reasoning, and provides more realistic text and audio versions. MetaQA serves as a comprehensive extension of WikiMovies.
WebQuestions
Dataset contains 6,642 question/answer pairs. The questions are supposed to be answerable by Freebase, a large knowledge graph. The questions are mostly centered around a single named entity.
MathQA
Dataset contains English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset.
SherLIiC
Dataset contains manually annotated inference rule candidates (InfCands), accompanied by ~960k unlabeled InfCands, and ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
Multimodal Sarcasm Detection Dataset (MUStARD)
The dataset, a multimodal video corpus, consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs.
Multimodal EmotionLines Dataset (MELD)
Dataset contains the same dialogue instances available in EmotionLines dataset, but it also encompasses audio and visual modality along with text. It has more than 1,400 dialogues and 13,000 utterances from Friends TV series. Each utterance in a dialogue has been labeled by any of these seven emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. It also has sentiment (positive, negative and neutral) annotation for each utterance.
Book Depository Dataset
Dataset contains books from bookdepository.com, not the actual content of the book but a list of metadata like title, description, dimensions, category and others.
COVID-19 Open Research Dataset (CORD-19)
Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Neutralizing Biased Text
A parallel corpus of 180,000+ sentence pairs where one sentence is biased and the other is neutralized. The data were obtained from debiasing wikipedia edits.
Wikipedia News Corpus
Text from Wikipedia's current events page with dates.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
Taskmaster-2
Dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478). It consists entirely of spoken two-person dialogs.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
The TAC Relation Extraction Dataset (TACRED)
A relation extraction dataset containing 106k+ examples covering 42 TAC KBP relation types. Costs $25 for non-members.
Webis-TLDR-17 Corpus
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Webis-Snippet-20 Corpus
Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected.
WSD English All-Words Fine-Grained Datasets
Unified five standard all-words Word Sense Disambiguation datasets.
Curation Corpus
Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviewsā€™ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
LC-QuAD 2.0
Dataset contains questions and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.
X-Sum
The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
Open Images V6
Dataset containing millions of images that have been annotated with image-level labels and object bounding boxes.
Explain Like Iā€™m Five (ELI5)
The dataset contains 270K threads of open-ended questions that require multi-sentence answers. It was extracted from subreddit titled ā€œExplain Like Iā€™m Fiveā€ (ELI5), in which an online community answers questions with responses that 5-year-olds can comprehend. Facebook scripts allow you to preprocess data.
Background Knowledge Dialogue Dataset
Dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie.
Academic
Questions about the Microsoft Academic Search (MAS) database, derived by enumerating every logical query that could be expressed using the search page of the MAS website and writing sentences to match them.
Advising
Dataset contains questions regarding course information at the University of Michigan, but with fictional student records.
ATIS
Dataset is a collection of utterances to a flight booking system, accompanied by a relational database and SQL queries to answer the questions.
Break
Dataset contains 83,978 examples sampled from 10 question answering datasets over text, images and databases. Dataset used to obtain the Question Decomposition Meaning Representation (QDMR) for questions.
Coarse Discourse
Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API.
Complex Factoid Question Answering with Paraphrase Clusters (ComQA)
The dataset contains questions with various challenging phenomena such as the need for temporal reasoning, comparison (e.g., comparatives, superlatives, ordinals), compositionality (multiple, possibly nested, subquestions with multiple entities), and unanswerable questions.
GAP Coreference Dataset
Dataset contains 8,908 gender-balanced coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia.
GeoQuery
Dataset contains utterances issued to a database of US geographical facts.
PG-19
Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates.
Restaurants
Dataset contains user questions about restaurants, their food types, and locations.
Scholar
User questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct.
Trec CAR Dataset
Dataset contains topics, outlines, and paragraphs that are extracted from English Wikipedia (2016 XML dump). Wikipedia articles are split into the outline of sections and the contained paragraphs.
Wikipedia
The 2016-12-21 dump of English Wikipedia.
WikiSplit
Dataset contains 1 million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
WikiSQL
A large collection of automatically generated questions about individual tables from Wikipedia.
AG News
Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Groningen Meaning Bank
Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic.
Kensho Derived Wikimedia Dataset (KDWD)
Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base.
Language Modeling Broadened to Account for Discourse Aspects (LAMBADA)
Dataset contains narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
LitBank
Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.
QASC
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
Quoref
Dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions.
SemEval-2019 Task 9 - Subtask A
Suggestion Mining from Online Reviews and Forums: Dataset contains corpora of unstructured text with the intent for mining it for suggestions.
SemEval-2019 Task 9 - Subtask B
Suggestion Mining from Hotel Reviews: Dataset contains corpora of unstructured text with the intent for mining it for suggestions.
Sentences Involving Compositional Knowledge (SICK)
Dataset contains sentence pairs, generated from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description.
Wikidata NE dataset
Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases.
WikiText-103 & 2
Dataset contains word and character level tokens extracted from Wikipedia
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CLEVR & CoGenT)
Visual question answering dataset contains 100,000 images and 999,968 questions.
Abductive Natural Language Inference (aNLI)
Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations."
Common Objects in Context (COCO)
COCO is a large-scale object detection, segmentation, and captioning dataset. Dataset contains 330K images (>200K labeled) 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image.
Cornell Natural Language for Visual Reasoning (NLVR and NLVR2)
Dataset contains two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input.
Dialogue Natural Language Inference (NLI)
Dataset used to improve the consistency of a dialogue model. It consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C)."
EmoBank
Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme.
EmpatheticDialogues
Dataset of 25k conversations grounded in emotional situations.
Fact-based Visual Question Answering (FVQA)
Dataset contains image question anwering triples
HellaSwag
Dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene.
InsuranceQA
Dataset contains questions and answers collected from the website Insurance Library. It consists of questions from real world users, the answers with high quality were composed by professionals with deep domain knowledge. There are 16,889 questions in total.
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
OneCommon
Dataset contains 6,760 dialogues.
Physical IQA
Dataset is used for commonsense QA benchmark for naive physics reasoning focusing on how we interact with everyday objects in everyday situations. The dataset includes 20,000 QA pairs that are either multiple-choice or true/false questions.
QA-SRL Bank
Dataset contains question answer pairs for 64,000 sentences. Dataset is used to train model for semantic role labeling
QA-ZRE
Dataset contain question answer pairs with each instance containing a relation, a question, a sentence, and an answer set.
ReVerb45k, Base and Ambiguous
3 Datasets. In total, there are 91K triples.
Simplified Versions of the CommAI Navigation tasks (SCAN)
Dataset used for for studying compositional learning and zero-shot generalization. SCAN consists of a set of commands and their corresponding action sequences.
Social IQA
Dataset used fo question-answering benchmark for testing social commonsense intelligence.
Twitter Chat Corpus
Dataset contains Twitter question-answer pairs.
VisDial
Dataset contains images from COCO training set, and dialogues. Meant to be used for model to be trained in answering questions about images during conversation. Contains 1.2M dialog question-answers.
WinoGrande
Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
Affective Text
Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
DailyDialog
A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise.
Dataset for Intent Classification and Out-of-Scope Prediction
Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries.
DiscoFuse
Dataset contains examples for training sentence fusion models. Sentence fusion is the task of joining several independent sentences into a single coherent text. The data has been collected from Wikipedia and from Sports articles.
Emotion-Stimulus
Dataset annotated with both the emotion and the stimulus using FrameNetā€™s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Event2Mind
Dataset contains 25,000 events and free-form descriptions of their intents and reactions
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
Paraphrase Adversaries from Word Scrambling (PAWS)
Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Relation Extraction Corpus
A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution."
Soccer Dialogues
Dataset contains soccer dialogues over a knowledge graph
Social Media Mining for Health (SMM4H)
Dataset contains medication-related text classification and concept normalization from Twitter
Switchboard Dialogue Act Corpus (SwDA)
A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags
The Emotion in Text
Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger.
A Conversational Question Answering Challenge (CoQA)
Dataset for measuring the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.
A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP)
Dataset is used to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
ABC Australia News Corpus
Entire news corpus of ABC Australia from 2003 to 2019.
Activitynet-QA
Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal.
AI2 Reasoning Challenge (ARC)
Dataset contains 7,787 genuine grade-school level, multiple-choice science questions.
AI2 Science Questions Mercury
Dataset consists of questions used in student assessments across elementary and middle school grade levels. Includes questions with diagrams and without.
AI2 Science Questions v2.1
Dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
An Open Information Extraction Corpus (OPIEC)
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia containing more than 341M triples.
AQuA
Dataset containing algebraic word problems with rationales for their answers.
Aristo Tuple KB
Dataset contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.
arXiv Bulk Data
A collection of research papers on arXiv.
ASU Twitter Dataset
Twitter network data, not actual tweets. Shows connections between a large number of users.
Automated Essay Scoring
Dataset contains student-written essays with scores.
Automatic Keyphrase Extraction
Multiple datasets for automatic keyphrase extraction.
bAbI 20 Tasks
Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts.
babI 6 Tasks Dialogue
Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain.
BlogFeedback Dataset
Dataset to predict the number of comments a post will receive based on features of that post.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
BoolQ
Question answering dataset for yes/no questions.
Buzz in Social Media Dataset
Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.
Car Evaluation Dataset
Car properties and their overall acceptability.
Childrenā€™s Book Test (CBT)
Dataset contains ā€˜questionsā€™ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query.
Choice of Plausible Alternatives (COPA)
Dataset used for open-domain commonsense causal reasoning.
Clinical Case Reports for Machine Reading Comprehension (CliCR)
Dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity.
ClueWeb Corpora
Annotated web pages from the ClueWeb09 and ClueWeb12 corpora.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)
Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos.
CNN / Daily Mail Dataset
Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles.
Coached Conversational Preference Elicitation
Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
CommitmentBank
Dataset contains naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
COmmonsense Dataset Adversarially-authored by Humans (CODAH)
Commonsense QA in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions.
CommonsenseQA
Dataset contains multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers.
ComplexWebQuestions
Dataset includes pairs of simple questions and their corresponding SPARQL query. SPARQL queries were taken from WEBQUESTIONSSP and automatically created more complex queries that include phenomena such as function composition, conjunctions, superlatives and comparatives.
Conceptual Captions
Dataset contains ~3.3M images annotated with captions to be used for the task of automatically producing a natural-language description for an image.
Conversational Text-to-SQL Systems (CoSQL)
Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue version of the Spider and SParC tasks.
Cornell Movie--Dialogs Corpus
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie characters, involves 9,035 characters from 617 moviesin. total 304,713 utterances.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
Corporate Messaging Corpus
Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.
Cosmos QA
Dataset containing thousands of problems that require commonsense-based reading comprehension, formulated as multiple-choice questions.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).
Dataset for the Machine Comprehension of Text
Stories and associated questions for testing comprehension of text.
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with negotiations and complex communication.
DEXTER Dataset
Task given is to determine, from features given, which articles are about corporate acquisitions.
DVQA
Dataset containing data visualizations and natural language questions.
Enron Email Dataset
Emails from employees at Enron organized into folders.
Examiner Pseudo-News Corpus
Clickbait, spam, crowd-sourced headlines from 2010 to 2015.
Explanations for Science Questions
Data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines.
GQA
Question answering on image scene graphs.
Hansards Canadian Parliament
Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament.
Harvard Library
Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Hate Speech Identification Dataset
Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general.
Historical Newspapers Daily Word Time Series Dataset
Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922.
Home Depot Product Search Relevance
Dataset contains a number of products and real customer search terms from Home Depot's website.
HotpotQA
Dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
Human-in-the-loop Dialogue Simulator (HITL)
Dataset provides a framework for evaluating a botā€™s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog partner.
Jeapardy Questions Answers
Dataset contains Jeopardy questions, answers and other data.
Legal Case Reports
Federal Court of Australia cases from 2006 to 2009.
LibriSpeech ASR
Large-scale (1000 hours) corpus of read English speech.
Ling-Spam Dataset
Corpus contains both legitimate and spam emails.
Meta-Learning Wizard-of-Oz (MetaLWOz)
Dataset designed to help develop models capable of predicting user responses in unseen domains. It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains.
Microsoft Information-Seeking Conversation (MISC) dataset
Dataset contains recordings of information-seeking conversations between human ā€œseekersā€ and ā€œintermediariesā€. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort.
Microsoft Machine Reading COmprehension Dataset (MS MARCO)
Dataset focused on machine reading comprehension, question answering, and passage ranking, keyphrase extraction, and conversational search studies.
Microsoft Research Paraphrase Corpus (MRPC)
Dataset contains pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
Microsoft Research Social Media Conversation Corpus
A-B-A triples extracted from Twitter.
MovieLens
Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.
MovieTweetings
Movie rating dataset based on public and well-structured tweets.
MSParS
Dataset for the open domain semantic parsing task.
Multi-Domain Wizard-of-Oz Dataset (MultiWoz)
Dataset of human-human written conversations spanning over multiple domains and topics. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk.
Multimodal Comprehension of Cooking Recipes (RecipeQA)
Dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images.
MultiNLI Matched/Mismatched
Dataset contains sentence pairs annotated with textual entailment information.
MutualFriends
Task where two agents must discover which friend of theirs is mutual based on the friend's attributes.
NarrativeQA
Dataset contains the list of documents with Wikipedia summaries, links to full stories, and questions and answers.
Natural Questions (NQ)
Dataset contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question.
News Headlines Dataset for Sarcasm Detection
High quality dataset with Sarcastic and Non-sarcastic news headlines.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NewsQA
Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN.
NPS Chat Corpus
Posts from age-specific online chat rooms.
NUS SMS Corpus
SMS messages collected between 2 users, with timing analysis.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenBookQA
Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
OpinRank Review Dataset
Reviews of cars and hotels from Edmunds.com and TripAdvisor.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personalized Dialog
Dataset of dialogs from movie scripts.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
ProPara Dataset
Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process.
QuaRel Dataset
Dataset contains 2,771 story questions about qualitative relationships.
QuaRTz Dataset
Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
Quasar-S & T
The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources.
Question Answering in Context (QuAC)
Dataset for modeling, understanding, and participating in information seeking dialog.
Question NLI
Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context.
Quora Question Pairs
The task is to determine whether a pair of questions are semantically equivalent.
ReAding Comprehension Dataset From Examinations (RACE)
Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning.
Reading Comprehension over Multiple Sentences (MultiRC)
Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
Reading Comprehension with Commonsense Reasoning Dataset (Record)
Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles.
Reading Comprehension with Multiple Hops (Qangaroo)
Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers).
Recognizing Textual Entailment (RTE)
Datasets are combined and converted to two-class classification: entailment and not_entailment.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Schema-Guided Dialogue State Tracking (DSTC 8)
Dataset contains 18K dialogues between a virtual assistant and a user.
SciQ Dataset
Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each.
SciTail Dataset
Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
SearchQA
Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average.
Semantic Parsing in Context (SParC)
Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task.
Semantic Textual Similarity Benchmark
The task is to predict textual similarity between sentence pairs.
SemEvalCQA
Dataset for community question answering.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Shaping Answers with Rules through Conversation (ShARC)
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
Short Answer Scoring
Student-written short-answer responses.
Situations With Adversarial Generations (SWAG)
Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
SNAP Social Circles: Twitter Database
Large Twitter network data.
Social-IQ Dataset
Dataset containing videos and natural language questions for visual reasoning.
Spambase Dataset
Dataset contains spam emails.
Spider 1.0
Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
SQuAD v2.0
Paragraphs w/ questions and answers.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Stanford Natural Language Inference (SNLI) Corpus
Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.
T-REx
Dataset contains Wikipedia abstracts aligned with Wikidata entities.
TabFact
Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence.
Taskmaster-1
Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Textbook Question Answering
The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images.
TextVQA
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
The Benchmark of Linguistic Minimal Pairs (BLiMP)
BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
The Conversational Intelligence Challenge 2 (ConvAI2)
A chit-chat dataset based on PersonaChat dataset.
The Corpus of Linguistic Acceptability (CoLa)
Dataset used to classifiy sentences as grammatical or not grammatical.
The Dialog-based Language Learning Dataset
Dataset was designed to measure how well models can perform at learning as a student given a teacherā€™s textual responses to the studentā€™s answer.
The Irish Times IRS
Dataset contains 23 years of events from Ireland.
The Movie Dialog Dataset
Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion).
The Penn Treebank Project
Naturally occurring text annotated for linguistic structure.
The SimpleQuestions Dataset
Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
The Story Cloze Test | ROCStories
Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story.
The WikiMovies Dataset
Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia.
Topical-Chat
A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners donā€™t have explicitly defined roles.
Total-Text-Dataset
Dataset used to classify curved text in pictures.
TrecQA
Dataset is commonly used for evaluating answer selection in question answering.
TriviaQA
Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average.
TupleInf Open IE Dataset
Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T).
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Twitter100k
Pairs of images and tweets.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
Urban Dictionary Dataset
Corpus of words, votes and definitions.
UseNet Corpus
UseNet forum postings.
Video Commonsense Reasoning (VCR)
Dataset contains 290K multiple-choice questions on 110K images.
Visual QA (VQA)
Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer.
Voices Obscured in Complex Environmental Settings (VOiCES)
Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
WebQuestions Semantic Parses Dataset
Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and ā€œpartialā€ annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer.
Who Did What Dataset
Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
News Headlines Dataset for Sarcasm Detection
High quality dataset with Sarcastic and Non-sarcastic news headlines.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NewsQA
Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN.
NPS Chat Corpus
Posts from age-specific online chat rooms.
NUS SMS Corpus
SMS messages collected between 2 users, with timing analysis.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenBookQA
Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
OpinRank Review Dataset
Reviews of cars and hotels from Edmunds.com and TripAdvisor.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personalized Dialog
Dataset of dialogs from movie scripts.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
ProPara Dataset
Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process.
QuaRel Dataset
Dataset contains 2,771 story questions about qualitative relationships.
QuaRTz Dataset
Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
Quasar-S & T
The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources.
Question Answering in Context (QuAC)
Dataset for modeling, understanding, and participating in information seeking dialog.
Question NLI
Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context.
Quora Question Pairs
The task is to determine whether a pair of questions are semantically equivalent.
ReAding Comprehension Dataset From Examinations (RACE)
Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning.
Reading Comprehension over Multiple Sentences (MultiRC)
Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
Reading Comprehension with Commonsense Reasoning Dataset (Record)
Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles.
Reading Comprehension with Multiple Hops (Qangaroo)
Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers).
Recognizing Textual Entailment (RTE)
Datasets are combined and converted to two-class classification: entailment and not_entailment.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Schema-Guided Dialogue State Tracking (DSTC 8)
Dataset contains 18K dialogues between a virtual assistant and a user.
SciQ Dataset
Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each.
SciTail Dataset
Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
SearchQA
Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average.
Semantic Parsing in Context (SParC)
Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task.
Semantic Textual Similarity Benchmark
The task is to predict textual similarity between sentence pairs.
SemEvalCQA
Dataset for community question answering.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Shaping Answers with Rules through Conversation (ShARC)
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
Short Answer Scoring
Student-written short-answer responses.
Situations With Adversarial Generations (SWAG)
Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
SNAP Social Circles: Twitter Database
Large Twitter network data.
Social-IQ Dataset
Dataset containing videos and natural language questions for visual reasoning.
Spambase Dataset
Dataset contains spam emails.
Spider 1.0
Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
SQuAD v2.0
Paragraphs w/ questions and answers.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Stanford Natural Language Inference (SNLI) Corpus
Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.
T-REx
Dataset contains Wikipedia abstracts aligned with Wikidata entities.
TabFact
Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence.
Taskmaster-1
Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Textbook Question Answering
The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images.
TextVQA
TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions.
The Benchmark of Linguistic Minimal Pairs (BLiMP)
BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.
The Conversational Intelligence Challenge 2 (ConvAI2)
A chit-chat dataset based on PersonaChat dataset.
The Corpus of Linguistic Acceptability (CoLa)
Dataset used to classifiy sentences as grammatical or not grammatical.
The Dialog-based Language Learning Dataset
Dataset was designed to measure how well models can perform at learning as a student given a teacherā€™s textual responses to the studentā€™s answer.
The Irish Times IRS
Dataset contains 23 years of events from Ireland.
The Movie Dialog Dataset
Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion).
The Penn Treebank Project
Naturally occurring text annotated for linguistic structure.
The SimpleQuestions Dataset
Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
The Story Cloze Test | ROCStories
Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story.
The WikiMovies Dataset
Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia.
Topical-Chat
A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners donā€™t have explicitly defined roles.
Total-Text-Dataset
Dataset used to classify curved text in pictures.
TrecQA
Dataset is commonly used for evaluating answer selection in question answering.
TriviaQA
Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average.
TupleInf Open IE Dataset
Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T).
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Twitter100k
Pairs of images and tweets.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
Urban Dictionary Dataset
Corpus of words, votes and definitions.
UseNet Corpus
UseNet forum postings.
Video Commonsense Reasoning (VCR)
Dataset contains 290K multiple-choice questions on 110K images.
Visual QA (VQA)
Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer.
Voices Obscured in Complex Environmental Settings (VOiCES)
Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
WebQuestions Semantic Parses Dataset
Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and ā€œpartialā€ annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer.
Who Did What Dataset
Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
WikiQA Corpus
Dataset contains Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. 
Winogender Schemas
Dataset with pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
Words in Context
Dataset for evaluating contextualized word representations.
Yahoo! Music User Ratings of Musical Artists
Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems or collaborative filtering algorithms.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.
YouTube Comedy Slam Preference Dataset
User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Open Table-and-Text Question Answering (OTT-QA)
Dataset contains open questions which require retrieving tables and text from the web to answer. The dataset is built on the HybridQA dataset.
Taskmaster-3
Dataset consists of 23,757 movie ticketing dialogs. "Movie ticketing" is defined as conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction.
STAR
A schema-guided task oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog.

Classify and extract text 10x better and faster šŸ¦¾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.