List of Spanish Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Spanish language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Spanish NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Spanish datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 14 Spanish Datasets

Let’s get started!

CC100-Spanish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset
Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com
CC100-Spanish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
Conference on Computational Natural Language Learning (CoNLL 2002)
Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format.
Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset
Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers.
HeadQA
Dataset is a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.