List of Vietnamese Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Vietnamese language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Vietnamese NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Vietnamese datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 16 Vietnamese Datasets

Let’s get started!

CC100-Vietnamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
Vietnamese Question Answering Dataset (ViQuAD)
Dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. [REQUIRES GETTING AUTHOR PERMISSION]
Vietnamese Multiple-choice Machine Reading Comprehension Corpus (ViMMRC)
Dataset contains 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. [requires contacting author for corpus]
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
UIT-SPC
Dataset contains 1,565 papers of top NLP/CL conferences such as ACL, CoNLL , EACL NAACL and EMNLP. They are pre-processed by removing unnecessary information (e.g formula, table, etc). Then, they were formatted to .xml that includes the title paper, sections, and sub-sections according to the paper's structure. [requires contacting author for corpus]
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)
Dataset contains 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese.
Vietnamese Image Captioning Dataset (UIT-ViIC)
Dataset consists of 19,250 captions for 3,850 images on sport-ball. [requires contacting author for corpus]
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
CC100-Vietnamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
Vietnamese Question Answering Dataset (ViQuAD)
Dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. [REQUIRES GETTING AUTHOR PERMISSION]
Vietnamese Multiple-choice Machine Reading Comprehension Corpus (ViMMRC)
Dataset contains 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. [requires contacting author for corpus]
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
UIT-SPC
Dataset contains 1,565 papers of top NLP/CL conferences such as ACL, CoNLL , EACL NAACL and EMNLP. They are pre-processed by removing unnecessary information (e.g formula, table, etc). Then, they were formatted to .xml that includes the title paper, sections, and sub-sections according to the paper's structure. [requires contacting author for corpus]
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)
Dataset contains 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese.
Vietnamese Image Captioning Dataset (UIT-ViIC)
Dataset consists of 19,250 captions for 3,850 images on sport-ball. [requires contacting author for corpus]
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.