List of German Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like German language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of German NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with German datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 30 German Datasets

Let’s get started!

Argumentation Annotated Student Peer Reviews Corpus (AASPRC)
Dataset contains 1,000 persuasive student peer reviews about business model feedbacks annotated for their argumentative components and argumentative relations.
CC100-German
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Named Entity Model for German, Politics (NEMGP)
Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
Customer Interaction Data of German Emails and Online Requests
Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company.
GermEval 2014 NER Shared Task
The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
Wikidata NE dataset
Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
Argumentation Annotated Student Peer Reviews Corpus (AASPRC)
Dataset contains 1,000 persuasive student peer reviews about business model feedbacks annotated for their argumentative components and argumentative relations.
CC100-German
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G.
WebNLG (Enriched)
Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units.
Multi30k
Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset.
ParCorFull
A parallel corpus annotated for the task of translation of corefrence across languages.
LibriVoxDeEn
Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences.
Named Entity Model for German, Politics (NEMGP)
Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information.
Conference on Computational Natural Language Learning (CoNLL 2003)
Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.
Customer Interaction Data of German Emails and Online Requests
Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company.
GermEval 2014 NER Shared Task
The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
Wikidata NE dataset
Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.