List of Polish Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Polish language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Polish NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Polish datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 18 Polish Datasets

Let’s get started!

CC100-Polish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
NKJP-NER
Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
Compositional Distributional Semantics Corpus (CDSC | E & R)
Dataset is s human-annotated for semantic relatedness and entailment by 3 human judges experienced in Polish linguistics.
Cyberbullying Detection (CBD)
Dataset contains annotated tweets that identify harmful or non-harmful content.
PolEmo2.0-IN & OUT
Dataset contains online reviews from medicine and hotels domains. The task is to predict the sentiment of a review.
Did You Know (DYK)
Dataset contains of 4,721 question–answer pairs obtained from Czy wiesz (Do you know) Wikipedia project.
Polish Summaries Corpus (PSC)
Dataset contains news articles and their summaries.
Polish Parliamentary Corpus (PPC)
Dataset is a collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100-Polish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
NKJP-NER
Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
Compositional Distributional Semantics Corpus (CDSC | E & R)
Dataset is s human-annotated for semantic relatedness and entailment by 3 human judges experienced in Polish linguistics.
Cyberbullying Detection (CBD)
Dataset contains annotated tweets that identify harmful or non-harmful content.
PolEmo2.0-IN & OUT
Dataset contains online reviews from medicine and hotels domains. The task is to predict the sentiment of a review.
Did You Know (DYK)
Dataset contains of 4,721 question–answer pairs obtained from Czy wiesz (Do you know) Wikipedia project.
Polish Summaries Corpus (PSC)
Dataset contains news articles and their summaries.
Polish Parliamentary Corpus (PPC)
Dataset is a collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.