List of Corpora Datasets for Machine Learning Projects
High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Corpora task, to get started your machine learning projects. Bellow your find a large curated training base for Corpora.
What is Corpora task?
NLP corpora datasets is a type of structured learning data that contains texts in various domains.
Custom fine-tune with Corpora datasets
Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️ Try for free
Found 550 Corpora Datasets
Let’s get started!
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.
CC100-Afrikaans
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 305M.
CC100-Amharic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 133M.
CC100-Arabic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100-Assamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.6M.
CC100-Azerbaijani
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Belarusian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 692M.
CC100-Bulgarian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 9.3G.
CC100-Bengali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 860M.
CC100-Bengali Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 164M.
CC100-Breton
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21M.
CC100-Bosnian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18M.
CC100-Catalan
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.4G.
CC100-Czech
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.4G.
CC100-Welsh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 179M.
CC100-Danish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100-German
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G.
CC100-Greek
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.4G.
CC100-English
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 82G.
CC100-Esperanto
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 250M.
CC100-Spanish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Estonian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.7G.
CC100-Basque
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 488M.
CC100-Persian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 20G.
CC100-Fulah
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.1M.
CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-French
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Frisian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 38M.
CC100-Irish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 108M.
CC100-Scottish Gaelic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 22M.
CC100-Galician
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 708M.
CC100-Guarani
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100-Gujarati
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 242M.
CC100-Hausa
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 61M.
CC100-Hebrew
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100-Hindi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.5G.
CC100-Hindi Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 129M.
CC100-Croatian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.7G.
CC100-Haitian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 9.1M.
CC100-Hungarian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-Armenian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 776M.
CC100-Indonesian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 36G.
CC100-Igbo
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.6M.
CC100-Icelandic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 779M.
CC100-Italian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.8G.
CC100-Japanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-Javanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 37M.
CC100-Georgian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1G.
CC100-Kazakh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 889M.
CC100-Khmer
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 153M.
CC100-Kannada
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 360M.
CC100-Korean
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Kurdish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100-Kyrgyz
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 173M.
CC100-Latin
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 609M.
CC100-Ganda
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.3M.
CC100-Limburgish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.2M.
CC100-Lingala
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.3M.
CC100-Lao
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 63M.
CC100-Lithuanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.4G.
CC100-Latvian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100-Malagasy
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 29M.
CC100-Macedonian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 706M.
CC100-Malayalam
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 831M.
CC100-Mongolian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 397M.
CC100-Marathi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 334M.
CC100-Malay
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100-Burmese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100-Burmese (Zawgyi)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 178M.
CC100-Nepali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 393M.
CC100-Dutch
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
CC100-Norwegian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100-Northern Sotho
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.8M.
CC100-Oromo
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 11M.
CC100-Oriya
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 56M.
CC100-Punjabi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100-Polish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100-Pashto
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 107M.
CC100-Portuguese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100-Quechua
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100-Romansh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.8M.
CC100-Romanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 16G.
CC100-Russian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
CC100-Sanskrit
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 44M.
CC100-Sinhala
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 452M.
CC100-Sardinian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 143K.
CC100-Sindhi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 67M.
CC100-Slovak
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100-Slovenian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.8G.
CC100-Somali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 78M.
CC100-Albanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Serbian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5G.
CC100-Swati
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 86K.
CC100-Sundanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15M.
CC100-Swedish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21G.
CC100-Swahili
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 332M.
CC100-Tamil
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Tamil Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 68M.
CC100-Telugu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 536M.
CC100-Telugu Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 79M.
CC100-Thai
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.7G.
CC100-Tagalog
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 701M.
CC100-Tswana
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.0M.
CC100-Turkish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100-Uyghur
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100-Ukrainian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Urdu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 884M.
CC100-Urdu Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 141M.
CC100-Uzbek
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 155M.
CC100-Vietnamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
CC100-Wolof
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.6M.
CC100-Xhosa
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 25M.
CC100-Yiddish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 51M.
CC100-Yoruba
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1M.
CC100-Chinese (Simplified)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Chinese (Traditional)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
CC100-Zulu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.3M.
Elsevier OA CC-BY
Dataset contains 40, 091 open access (OA) CC-BY articles from across Elsevier’s journals.
CC Net
Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
The Semantic Scholar Open Research Corpus (S2ORC)
Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.
ACL Anthology Reference Corpus (ACL ARC)
Dataset contains 10,921 articles from the February 2007 snapshot of the Anthology; text and metadata for the articles were extracted, consisting of BibTeX records derived either from the headers of each paper or from metadata taken from the Anthology website.
UIT-SPC
Dataset contains 1,565 papers of top NLP/CL conferences such as ACL, CoNLL , EACL NAACL and EMNLP. They are pre-processed by removing unnecessary information (e.g formula, table, etc). Then, they were formatted to .xml that includes the title paper, sections, and sub-sections according to the paper's structure. [requires contacting author for corpus]
Aesthetics Text Corpus
Dataset consists of novels and short stories written in Hindi language. Novels and stories were scraped from http://hindisamay.com, http://premchand.co.in, a website dedicated to the popular novelist Premchand’s stories, and Bhandarkar Oriental Research Institute’s Digital Library (http://borilib.com). As a preprocessing step, the text was split into sentences and special characters, English tokens and Latin numbers were deleted.
Hippocorpus
Dataset of 6,854 English diary-like short stories about recalled and imagined events.
EBM PICO
Dataset contains ~5,000 medical abstracts describing clinical trials, annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels).
FT Speech
Dataset contains recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.
ArxivPapers
Dataset is a corpus of over 100,000 scientific papers related to machine learning.
CodeSearchNet Corpus
Dataset contains functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.
NIPS Papers
Dataset contains the title, authors, abstracts, and extracted text for all NIPS papers between 1987-2016.
Ljspeech
Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
VoxForge
Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
COVID-19 Twitter Chatter Dataset
Dataset contains over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st, 2020 to present.
Self-Annotated Reddit Corpus (SARC)
Dataset contains 1.3 million sarcastic comments from the Internet commentary website Reddit. It contains statements, along with their responses as well as many non-sarcastic comments from the same source.
Yoruba Text
Multiple datasets scraped together for the Yoruba language.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
WikiText-TL-39
Dataset is a large scale, unlabeled text dataset with 39M tokens in the training set.
Statutory Reasoning Assessment (SARA)
Dataset contains a set of rules extracted from the statutes of the US Internal Revenue Code (IRC), together with a set of natural language questions which may only be answered correctly by referring to the rules.
Arabic in Business and Management Corpora (ABMC)
Dataset contains 400 Arab companies chairman and chief executive manager statements, 400 Arabic economic news articles, 400 Arabic stock market news articles.
Polish Parliamentary Corpus (PPC)
Dataset is a collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus.
All the News 2.0
Dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Leipzig Corpora Collection
Dataset containing 252 languages of web crawled news corpora.
BuGL
Dataset consists of 54 GitHub projects of four different programming languages namely C, C++, Java and Python with around 10,187 issues.
HJDataset
Dataset contains over 250,000 layout element annotations of seven types in Japanese documents.
PoKi
Dataset is a corpus of 61,330 poems written by children from grades 1 to 12.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
ArabicWeb16
Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA).
1.5 billion Words Arabic Corpus
The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years.
COVID-19 Open Research Dataset (CORD-19)
Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
Arabic Speech Corpus
Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
Khaleej-2004 Corpus
Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports.
Watan-2004 Corpus
Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports.
Parallel Arabic DIalectal Corpus (PADIC)
Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format.
Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)
Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.
Wikipedia News Corpus
Text from Wikipedia's current events page with dates.
Curation Corpus
Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
ECB Corpus
Website and documentation from the European Central Bank. Contains 19 languages.
Eubookshop
Corpus of documents from the EU bookshop. Contains 48 languages.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Coarse Discourse
Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API.
PG-19
Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates.
Wikipedia
The 2016-12-21 dump of English Wikipedia.
Customer Interaction Data of German Emails and Online Requests
Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company.
Groningen Meaning Bank
Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic.
Kensho Derived Wikimedia Dataset (KDWD)
Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base.
Parallel Meaning Bank
Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages.
Open Super-Large Crawled Almanach Corpus (OSCAR)
Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Portuguese Newswire Corpus
Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link.
ABC Australia News Corpus
Entire news corpus of ABC Australia from 2003 to 2019.
arXiv Bulk Data
A collection of research papers on arXiv.
CommonCrawl
Dataset contains data from 25 billion web pages.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
Enron Email Dataset
Emails from employees at Enron organized into folders.
European Parliament Proceedings (Europarl)
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.
Guttenberg Book Corpus
Dataset contains 60,000 eBooks.
Hansards Canadian Parliament
Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament.
Harvard Library
Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Historical Newspapers Daily Word Time Series Dataset
Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NLP Chinese Corpus
Large text corpora in Chinese.
One Week of Global News Feeds
Dataset contains most of the new news content published online over one week in 2017 and 2018.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NLP Chinese Corpus
Large text corpora in Chinese.
One Week of Global News Feeds
Dataset contains most of the new news content published online over one week in 2017 and 2018.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
WMT 19 Multiple Datasets
Multiple text corpora in multiple languages.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.3M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 63M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.4G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 29M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 706M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 831M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 397M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 334M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 178M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 393M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.8M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 11M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 56M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 107M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.8M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 16G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 44M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 452M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 143K.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 67M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.8G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 78M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 86K.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 332M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 68M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 536M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 79M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.7G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 701M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.0M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 884M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 141M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 155M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.6M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 25M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 51M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.3M.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
Lex2Kids
Este dataset contêm representação léxica em português mais ouvido por crianças. Contém 36,413 legendas de filmes e séries dos gêneros Família e Animação
ITD - Dataset de Acordãos do STF de 2010 a 2018
A base Iudicium Textum Dataset (ITD), contêm os textos extraídos dos Acórdãos do Supremo Tribunal Federal de 2010 a 2018. Os textos estão separados por seção, com os votos e os relatórios identificados por autor (ministro). O texto original também foi mantido de forma integral e as partes envolvidas, em grande parte, estão identificadas. Os dados estão organizados em um arquivo json, podendo ser importado para um banco MongoDB. Junto com a base, estão disponíveis também os arquivos pdfs originais, bem como as ferramentas e os códigos que foram utilizados para download, extração e conversão dos dados que compõem o dataset
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.
CC100-Afrikaans
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 305M.
CC100-Amharic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 133M.
CC100-Arabic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100-Assamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.6M.
CC100-Azerbaijani
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Belarusian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 692M.
CC100-Bulgarian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 9.3G.
CC100-Bengali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 860M.
CC100-Bengali Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 164M.
CC100-Breton
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21M.
CC100-Bosnian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18M.
CC100-Catalan
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.4G.
CC100-Czech
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.4G.
CC100-Welsh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 179M.
CC100-Danish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100-German
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G.
CC100-Greek
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.4G.
CC100-English
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 82G.
CC100-Esperanto
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 250M.
CC100-Spanish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Estonian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.7G.
CC100-Basque
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 488M.
CC100-Persian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 20G.
CC100-Fulah
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.1M.
CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-French
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Frisian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 38M.
CC100-Irish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 108M.
CC100-Scottish Gaelic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 22M.
CC100-Galician
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 708M.
CC100-Guarani
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100-Gujarati
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 242M.
CC100-Hausa
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 61M.
CC100-Hebrew
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100-Hindi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.5G.
CC100-Hindi Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 129M.
CC100-Croatian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.7G.
CC100-Haitian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 9.1M.
CC100-Hungarian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-Armenian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 776M.
CC100-Indonesian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 36G.
CC100-Igbo
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.6M.
CC100-Icelandic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 779M.
CC100-Italian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.8G.
CC100-Japanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.
CC100-Javanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 37M.
CC100-Georgian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1G.
CC100-Kazakh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 889M.
CC100-Khmer
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 153M.
CC100-Kannada
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 360M.
CC100-Korean
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Kurdish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100-Kyrgyz
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 173M.
CC100-Latin
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 609M.
CC100-Ganda
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.3M.
CC100-Limburgish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.2M.
CC100-Lingala
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.3M.
CC100-Lao
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 63M.
CC100-Lithuanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.4G.
CC100-Latvian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100-Malagasy
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 29M.
CC100-Macedonian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 706M.
CC100-Malayalam
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 831M.
CC100-Mongolian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 397M.
CC100-Marathi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 334M.
CC100-Malay
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100-Burmese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100-Burmese (Zawgyi)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 178M.
CC100-Nepali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 393M.
CC100-Dutch
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
CC100-Norwegian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100-Northern Sotho
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.8M.
CC100-Oromo
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 11M.
CC100-Oriya
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 56M.
CC100-Punjabi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100-Polish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100-Pashto
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 107M.
CC100-Portuguese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100-Quechua
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100-Romansh
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.8M.
CC100-Romanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 16G.
CC100-Russian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
CC100-Sanskrit
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 44M.
CC100-Sinhala
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 452M.
CC100-Sardinian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 143K.
CC100-Sindhi
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 67M.
CC100-Slovak
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100-Slovenian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.8G.
CC100-Somali
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 78M.
CC100-Albanian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Serbian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5G.
CC100-Swati
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 86K.
CC100-Sundanese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15M.
CC100-Swedish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21G.
CC100-Swahili
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 332M.
CC100-Tamil
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100-Tamil Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 68M.
CC100-Telugu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 536M.
CC100-Telugu Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 79M.
CC100-Thai
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.7G.
CC100-Tagalog
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 701M.
CC100-Tswana
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.0M.
CC100-Turkish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100-Uyghur
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100-Ukrainian
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Urdu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 884M.
CC100-Urdu Romanized
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 141M.
CC100-Uzbek
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 155M.
CC100-Vietnamese
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
CC100-Wolof
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.6M.
CC100-Xhosa
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 25M.
CC100-Yiddish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 51M.
CC100-Yoruba
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1M.
CC100-Chinese (Simplified)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Chinese (Traditional)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
CC100-Zulu
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.3M.
Elsevier OA CC-BY
Dataset contains 40, 091 open access (OA) CC-BY articles from across Elsevier’s journals.
CC Net
Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
The Semantic Scholar Open Research Corpus (S2ORC)
Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.
ACL Anthology Reference Corpus (ACL ARC)
Dataset contains 10,921 articles from the February 2007 snapshot of the Anthology; text and metadata for the articles were extracted, consisting of BibTeX records derived either from the headers of each paper or from metadata taken from the Anthology website.
UIT-SPC
Dataset contains 1,565 papers of top NLP/CL conferences such as ACL, CoNLL , EACL NAACL and EMNLP. They are pre-processed by removing unnecessary information (e.g formula, table, etc). Then, they were formatted to .xml that includes the title paper, sections, and sub-sections according to the paper's structure. [requires contacting author for corpus]
Aesthetics Text Corpus
Dataset consists of novels and short stories written in Hindi language. Novels and stories were scraped from http://hindisamay.com, http://premchand.co.in, a website dedicated to the popular novelist Premchand’s stories, and Bhandarkar Oriental Research Institute’s Digital Library (http://borilib.com). As a preprocessing step, the text was split into sentences and special characters, English tokens and Latin numbers were deleted.
Hippocorpus
Dataset of 6,854 English diary-like short stories about recalled and imagined events.
EBM PICO
Dataset contains ~5,000 medical abstracts describing clinical trials, annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels).
FT Speech
Dataset contains recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.
ArxivPapers
Dataset is a corpus of over 100,000 scientific papers related to machine learning.
CodeSearchNet Corpus
Dataset contains functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.
NIPS Papers
Dataset contains the title, authors, abstracts, and extracted text for all NIPS papers between 1987-2016.
Ljspeech
Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
VoxForge
Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated.
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
COVID-19 Twitter Chatter Dataset
Dataset contains over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st, 2020 to present.
Self-Annotated Reddit Corpus (SARC)
Dataset contains 1.3 million sarcastic comments from the Internet commentary website Reddit. It contains statements, along with their responses as well as many non-sarcastic comments from the same source.
Yoruba Text
Multiple datasets scraped together for the Yoruba language.
Igbo Text
Dataset is a parallel dataset for the Urhobo language.
Urhobo Text
Dataset is a parallel dataset containing 10.3M tokens.
WikiText-TL-39
Dataset is a large scale, unlabeled text dataset with 39M tokens in the training set.
Statutory Reasoning Assessment (SARA)
Dataset contains a set of rules extracted from the statutes of the US Internal Revenue Code (IRC), together with a set of natural language questions which may only be answered correctly by referring to the rules.
Arabic in Business and Management Corpora (ABMC)
Dataset contains 400 Arab companies chairman and chief executive manager statements, 400 Arabic economic news articles, 400 Arabic stock market news articles.
Polish Parliamentary Corpus (PPC)
Dataset is a collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus.
All the News 2.0
Dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020.
Leipzig Corpora Collection
Dataset containing 252 languages of web crawled news corpora.
BuGL
Dataset consists of 54 GitHub projects of four different programming languages namely C, C++, Java and Python with around 10,187 issues.
HJDataset
Dataset contains over 250,000 layout element annotations of seven types in Japanese documents.
PoKi
Dataset is a corpus of 61,330 poems written by children from grades 1 to 12.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
ArabicWeb16
Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA).
1.5 billion Words Arabic Corpus
The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years.
COVID-19 Open Research Dataset (CORD-19)
Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
Arabic Speech Corpus
Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
Khaleej-2004 Corpus
Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports.
Watan-2004 Corpus
Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports.
Parallel Arabic DIalectal Corpus (PADIC)
Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format.
Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS)
Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish.
Wikipedia News Corpus
Text from Wikipedia's current events page with dates.
Curation Corpus
Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.
DOGC
A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish.
ECB Corpus
Website and documentation from the European Central Bank. Contains 19 languages.
Eubookshop
Corpus of documents from the EU bookshop. Contains 48 languages.
Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.
Coarse Discourse
Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API.
PG-19
Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates.
Wikipedia
The 2016-12-21 dump of English Wikipedia.
Customer Interaction Data of German Emails and Online Requests
Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company.
Groningen Meaning Bank
Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic.
Kensho Derived Wikimedia Dataset (KDWD)
Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base.
Parallel Meaning Bank
Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages.
Open Super-Large Crawled Almanach Corpus (OSCAR)
Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Portuguese Newswire Corpus
Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link.
ABC Australia News Corpus
Entire news corpus of ABC Australia from 2003 to 2019.
arXiv Bulk Data
A collection of research papers on arXiv.
CommonCrawl
Dataset contains data from 25 billion web pages.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
Enron Email Dataset
Emails from employees at Enron organized into folders.
European Parliament Proceedings (Europarl)
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages.
Guttenberg Book Corpus
Dataset contains 60,000 eBooks.
Hansards Canadian Parliament
Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament.
Harvard Library
Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Historical Newspapers Daily Word Time Series Dataset
Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NLP Chinese Corpus
Large text corpora in Chinese.
One Week of Global News Feeds
Dataset contains most of the new news content published online over one week in 2017 and 2018.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
News Headlines Of India
Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India.
NLP Chinese Corpus
Large text corpora in Chinese.
One Week of Global News Feeds
Dataset contains most of the new news content published online over one week in 2017 and 2018.
Open Research Corpus
Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
OpenWebTextCorpus
Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.
Plaintext Jokes
208,000 jokes in this database scraped from three sources.
Reddit All Comments Corpus
All Reddit comments (as of 2017).
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
Stack Overlow BigQuery Dataset
BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
Ubuntu Dialogue Corpus
Dialogues extracted from Ubuntu chat stream on IRC.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiLinks
Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia.
WMT 19 Multiple Datasets
Multiple text corpora in multiple languages.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.3M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 63M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.4G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 29M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 706M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 831M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 397M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 334M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 178M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 393M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 7.9G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.8M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 11M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 56M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 90M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 12G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 107M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 13G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.8M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 16G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 44M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 452M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 143K.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 67M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.8G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 78M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.5G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 86K.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 21G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 332M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 68M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 536M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 79M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.7G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 701M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.0M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 884M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 141M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 155M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 28G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 3.6M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 25M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 51M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.1M.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 4.3M.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
Lex2Kids
Este dataset contêm representação léxica em português mais ouvido por crianças. Contém 36,413 legendas de filmes e séries dos gêneros Família e Animação
ITD - Dataset de Acordãos do STF de 2010 a 2018
A base Iudicium Textum Dataset (ITD), contêm os textos extraídos dos Acórdãos do Supremo Tribunal Federal de 2010 a 2018. Os textos estão separados por seção, com os votos e os relatórios identificados por autor (ministro). O texto original também foi mantido de forma integral e as partes envolvidas, em grande parte, estão identificadas. Os dados estão organizados em um arquivo json, podendo ser importado para um banco MongoDB. Junto com a base, estão disponíveis também os arquivos pdfs originais, bem como as ferramentas e os códigos que foram utilizados para download, extração e conversão dos dados que compõem o dataset
Classify and extract text 10x better and faster 🦾
Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.