List of Indonesian Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Indonesian language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Indonesian NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Indonesian datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 28 Indonesian Datasets

Let’s get started!

CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU)
Dataset consists of 450 sentence pairs constructed from Wikipedia revision history. It contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise.
POSP (IndoNLU)
Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
BaPOS (IndoNLU)
Dataset contains about 1,000 sentences, collected from the PAN Localization Project. In this dataset, each word is tagged by one of 23 POS tag classes.
TermA (IndoNLU)
Dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.
KEPS (IndoNLU)
Dataset consists of text from Twitter discussing banking products and services. A phrase containing important information is considered a keyphrase. Text may contain one or more keyphrases since important phrases can be located at different positions. The dataset follows the IOB chunking format, which represents the position of the keyphrase.
NERGrit (IndoNLU)
Dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization).
NERP (IndoNLU)
Dataset contains contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage).
FacQA (IndoNLU)
Dataset is to find the answer to a question from a provided short passage from a news article (Purwarianti et al., 2007). Each row in the FacQA dataset consists of a question, a short passage, and a label phrase, which can be found inside the corresponding short passage. There are six categories of questions: date, location, name, organization, person, and quantitative.
IndoSum
Dataset for text summarization in Indonesian that is compiled from online news articles and publicly available.
CC100-Indonesian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 36G.
CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
The Wiki Revision Edits Textual Entailment (WReTE) (IndoNLU)
Dataset consists of 450 sentence pairs constructed from Wikipedia revision history. It contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise.
POSP (IndoNLU)
Dataset is collected from Indonesian news websites. The dataset consists of around 8,000 sentences with 26 POS tags.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
BaPOS (IndoNLU)
Dataset contains about 1,000 sentences, collected from the PAN Localization Project. In this dataset, each word is tagged by one of 23 POS tag classes.
TermA (IndoNLU)
Dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.
KEPS (IndoNLU)
Dataset consists of text from Twitter discussing banking products and services. A phrase containing important information is considered a keyphrase. Text may contain one or more keyphrases since important phrases can be located at different positions. The dataset follows the IOB chunking format, which represents the position of the keyphrase.
NERGrit (IndoNLU)
Dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization).
NERP (IndoNLU)
Dataset contains contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage).
FacQA (IndoNLU)
Dataset is to find the answer to a question from a provided short passage from a news article (Purwarianti et al., 2007). Each row in the FacQA dataset consists of a question, a short passage, and a label phrase, which can be found inside the corresponding short passage. There are six categories of questions: date, location, name, organization, person, and quantitative.
IndoSum
Dataset for text summarization in Indonesian that is compiled from online news articles and publicly available.
CC100-Indonesian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 36G.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.