List of Dutch Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Dutch language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Dutch NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Dutch datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 12 Dutch Datasets

Let’s get started!

CC100-Dutch
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
CC100-Dutch
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.