List of Portuguese Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Portuguese language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Portuguese NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Portuguese datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 40 Portuguese Datasets

Let’s get started!

B5 Corpus
Dataset is a collection of Facebook posts, including information about brazilian authors, like gender, age, personality score (Based in B5 test), education level, politic position, religious, and others.
CC100-Portuguese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
HAREM
Dataset used for Named-Entity Recognition (NER) in Portuguese.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
Portuguese Newswire Corpus
Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link.
Portuguese SQuAD v1.1
Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
Rhetalho
A dataset annotated by Rhetorical Structure Theory – RST
Lex2Kids
Este dataset contêm representação léxica em português mais ouvido por crianças. Contém 36,413 legendas de filmes e séries dos gêneros Família e Animação
ITD - Dataset de Acordãos do STF de 2010 a 2018
A base Iudicium Textum Dataset (ITD), contêm os textos extraídos dos Acórdãos do Supremo Tribunal Federal de 2010 a 2018. Os textos estão separados por seção, com os votos e os relatórios identificados por autor (ministro). O texto original também foi mantido de forma integral e as partes envolvidas, em grande parte, estão identificadas. Os dados estão organizados em um arquivo json, podendo ser importado para um banco MongoDB. Junto com a base, estão disponíveis também os arquivos pdfs originais, bem como as ferramentas e os códigos que foram utilizados para download, extração e conversão dos dados que compõem o dataset
PortugueseGLUE
Este dataset contêm tradução para o português do benchmark GLUE e conjunto de dados Scitail usando o modelo OPUS-MT e Google Cloud Translation.
TweetSentBR
This dataset contains sentiment polarity classification, this dataset contains 800k tweets in Portuguese divided into positive, negative, and neutral classes
B2W-Reviews01
This dataset contains reviews from ecommerce products. About 130k customer reviews, extracted from Americanas.com, between Jan and May 2018. Including annotated data from customers profile, like ender, age, and geograph location.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com
CorpusTCC
This dataset contains scientific texts from brazilian community, about computer science field.
B5 Corpus
Dataset is a collection of Facebook posts, including information about brazilian authors, like gender, age, personality score (Based in B5 test), education level, politic position, religious, and others.
CC100-Portuguese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
HAREM
Dataset used for Named-Entity Recognition (NER) in Portuguese.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
CAPES
A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES.
Portuguese Newswire Corpus
Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link.
Portuguese SQuAD v1.1
Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
Rhetalho
A dataset annotated by Rhetorical Structure Theory – RST
Lex2Kids
Este dataset contêm representação léxica em português mais ouvido por crianças. Contém 36,413 legendas de filmes e séries dos gêneros Família e Animação
ITD - Dataset de Acordãos do STF de 2010 a 2018
A base Iudicium Textum Dataset (ITD), contêm os textos extraídos dos Acórdãos do Supremo Tribunal Federal de 2010 a 2018. Os textos estão separados por seção, com os votos e os relatórios identificados por autor (ministro). O texto original também foi mantido de forma integral e as partes envolvidas, em grande parte, estão identificadas. Os dados estão organizados em um arquivo json, podendo ser importado para um banco MongoDB. Junto com a base, estão disponíveis também os arquivos pdfs originais, bem como as ferramentas e os códigos que foram utilizados para download, extração e conversão dos dados que compõem o dataset
PortugueseGLUE
Este dataset contêm tradução para o português do benchmark GLUE e conjunto de dados Scitail usando o modelo OPUS-MT e Google Cloud Translation.
TweetSentBR
This dataset contains sentiment polarity classification, this dataset contains 800k tweets in Portuguese divided into positive, negative, and neutral classes
B2W-Reviews01
This dataset contains reviews from ecommerce products. About 130k customer reviews, extracted from Americanas.com, between Jan and May 2018. Including annotated data from customers profile, like ender, age, and geograph location.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com
CorpusTCC
This dataset contains scientific texts from brazilian community, about computer science field.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.