List of Hindi Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Hindi language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Hindi NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Hindi datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 12 Hindi Datasets

Let’s get started!

CC100-Hindi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.5G.
CC100-Hindi Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 129M.
Aesthetics Text Corpus
Dataset consists of novels and short stories written in Hindi language. Novels and stories were scraped from http://hindisamay.com, http://premchand.co.in, a website dedicated to the popular novelist Premchand’s stories, and Bhandarkar Oriental Research Institute’s Digital Library (http://borilib.com). As a preprocessing step, the text was split into sentences and special characters, English tokens and Latin numbers were deleted.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
bAbI 20 Tasks
Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts.
CC100-Hindi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.5G.
CC100-Hindi Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 129M.
Aesthetics Text Corpus
Dataset consists of novels and short stories written in Hindi language. Novels and stories were scraped from http://hindisamay.com, http://premchand.co.in, a website dedicated to the popular novelist Premchand’s stories, and Bhandarkar Oriental Research Institute’s Digital Library (http://borilib.com). As a preprocessing step, the text was split into sentences and special characters, English tokens and Latin numbers were deleted.
WAT 2019 Hindi-English
Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi.
IIT Bombay English-Hindi Corpus
Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources.
bAbI 20 Tasks
Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.