List of Finnish Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Finnish language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Finnish NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Finnish datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 14 Finnish Datasets

Let’s get started!

FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.
CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.
Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.
FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.
FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.
CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.
Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.
FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.