List of Korean Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Korean language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Korean NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Korean datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 14 Korean Datasets

Let’s get started!

CC100-Korean
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
Intonation-Aided Intention Identification for Korean (3i4K)
Dataset contains seven class annotated corpus of single text utterances/intents in conversation.
Korean Hate Speech Dataset
Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.
Korean Single Speaker Dataset (KSS)
Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books.
KorNLI
Dataset used for natural language inference for the Korean language.
KorSTS
Dataset used for semantic textual similarity for the Korean language.
KorQuAD
Dataset containing a total of 100,000+ question answer pairs.
CC100-Korean
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
Intonation-Aided Intention Identification for Korean (3i4K)
Dataset contains seven class annotated corpus of single text utterances/intents in conversation.
Korean Hate Speech Dataset
Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.
Korean Single Speaker Dataset (KSS)
Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books.
KorNLI
Dataset used for natural language inference for the Korean language.
KorSTS
Dataset used for semantic textual similarity for the Korean language.
KorQuAD
Dataset containing a total of 100,000+ question answer pairs.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.