List of Ner Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Ner task, to get started your machine learning projects. Bellow your find a large curated training base for Ner.

What is Ner task?

Named Entity Recognition (NER), is the process of converting unstructured text (text without the use of a markup language) into an annotated ontology leveraging a deep understanding of a specific domain (e.g., Medicine, Finance, etc) and language (e.g., English, Chinese, etc).


Custom fine-tune with Ner datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 86 Ner Datasets

Let’s get started!

NERGrit (IndoNLU)
Dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization).
NERP (IndoNLU)
Dataset contains contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage).
WikiNER
Dataset contains 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.
CoNLL 2003 ++
Similar to the original CoNLL except test set has been corrected for label mistakes. The dataset is split into training, development, and test sets, with 14,041, 3,250, and 3,453 instances respectively.
WNUT 2016
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format.
ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of ∼18M sentences spanning ∼45M triples with ∼1,500 distinct relations from English Wikidata.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
Inquisitive
Dataset contains ∼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
CodeXGLUE: CONCODE
Dataset is used for when a model is given the task to generate a code given natural language description.
BioCreative II Gene Mention Recognition (BC2GM)
Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access]
BC5CDR Drug/Chemical (BC5-Chem)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
BC5CDR Disease (BC5-Disease)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
JNLPBA
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. 
NCBI Disease Corpus
Dataset contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier.
ScienceExamCER
Dataset contains 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ClarQ
Dataset consists of ∼2M question/post tuples distributed across 173 domains of stackexchange.
Groove MIDI Dataset (GMD)
Dataset is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
HAREM
Dataset used for Named-Entity Recognition (NER) in Portuguese.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
NKJP-NER
Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
WNUT 2017
Dataset containing tweets, reddit comments, YouTube comments, and StackExchange were annotated with 6 entities: Person, Location, Corporation, Consumer good, Creative work, and Group.
BSNLP-2019
Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian.
Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.
WikiAnn
Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Densely Annotated Wikipedia Texts (DAWT)
Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.
Europeana Newspapers
Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format.
Named Entity Model for German, Politics (NEMGP)
Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information.
The NewsReader MEANTIME Corpus
480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
GermEval 2014 NER Shared Task
The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
LitBank
Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).
NERGrit (IndoNLU)
Dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization).
NERP (IndoNLU)
Dataset contains contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage).
WikiNER
Dataset contains 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.
CoNLL 2003 ++
Similar to the original CoNLL except test set has been corrected for label mistakes. The dataset is split into training, development, and test sets, with 14,041, 3,250, and 3,453 instances respectively.
WNUT 2016
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format.
ENT-DESC
Dataset was extracted from Wikipedia and Wikidata, which contains over 110k instances. Each sample is a triplet, containing a set of entities, the explored knowledge from a KG, and the description.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
Dataset consists of ∼18M sentences spanning ∼45M triples with ∼1,500 distinct relations from English Wikidata.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
ParaPhraser Plus
Dataset contains 7,227 pairs of sentences, which are classified by humans into three classes: 2,582 non-paraphrases, 2,957 near-paraphrases,and 1,688 precise-paraphrases.
Inquisitive
Dataset contains ∼19K questions that are elicited while a person is reading through a document. Compared to existing datasets, INQUISITIVE questions target more towards high-level (semantic and discourse) comprehension of text.
CodeXGLUE: CONCODE
Dataset is used for when a model is given the task to generate a code given natural language description.
BioCreative II Gene Mention Recognition (BC2GM)
Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access]
BC5CDR Drug/Chemical (BC5-Chem)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
BC5CDR Disease (BC5-Disease)
Dataset consists of three separate sets of articles with chemicals and their relations annotated. [registration required for access]
JNLPBA
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. 
NCBI Disease Corpus
Dataset contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier.
ScienceExamCER
Dataset contains 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms.
Tumblr GIF (TGIF)
Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs.
ClarQ
Dataset consists of ∼2M question/post tuples distributed across 173 domains of stackexchange.
Groove MIDI Dataset (GMD)
Dataset is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
WikiBio
Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized).
E2E
Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
HAREM
Dataset used for Named-Entity Recognition (NER) in Portuguese.
PARANMT-50M
Dataset containing more than 50 million English-English sentential paraphrase pairs.
NKJP-NER
Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
Post-Modifier Dataset (PoMo)
Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
WNUT 2017
Dataset containing tweets, reddit comments, YouTube comments, and StackExchange were annotated with 6 entities: Person, Location, Corporation, Consumer good, Creative work, and Group.
BSNLP-2019
Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian.
Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.
WikiAnn
Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.
CommonGen
Dataset consists of 30k concept-sets with humanwritten sentences as references.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Densely Annotated Wikipedia Texts (DAWT)
Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity.
Europeana Newspapers
Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format.
Named Entity Model for German, Politics (NEMGP)
Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information.
The NewsReader MEANTIME Corpus
480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
GermEval 2014 NER Shared Task
The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
LitBank
Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.
Dataset for Fill-in-the-Blank Humor
Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb).

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.