List of Generation Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Generation task, to get started your machine learning projects. Bellow your find a large curated training base for Generation.

What is Generation task?

This is a sub-domain of Natural Language Processing (NLP). Text Generation converts natural language to natural language, such that the converted text is indistinguishable from that which was originally generated by a human writer.


Custom fine-tune with Generation datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 34 Generation Datasets

Let’s get started!

ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of ∼18M sentences spanning ∼45M triples with ∼1,500 distinct relations from English Wikidata.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
Inquisitive
Dataset contains ∼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
CodeXGLUE: CONCODE
Dataset is used for when a model is given the task to generate a code given natural language description.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ClarQ
Dataset consists of ∼2M question/post tuples distributed across 173 domains of stackexchange.
Groove MIDI Dataset (GMD)
Dataset is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).
ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of ∼18M sentences spanning ∼45M triples with ∼1,500 distinct relations from English Wikidata.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
Inquisitive
Dataset contains ∼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
CodeXGLUE: CONCODE
Dataset is used for when a model is given the task to generate a code given natural language description.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ClarQ
Dataset consists of ∼2M question/post tuples distributed across 173 domains of stackexchange.
Groove MIDI Dataset (GMD)
Dataset is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.