Skip to content

Text Corpora Datasets

There are 154 text corpora datasets in our directory, 5 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.

Text Corpora is a machine-learning task covered in our directory. We catalog 154 datasets for it.

Updated June 2026

What languages do text corpora datasets cover?

Explore other dataset tasks

Frequently asked questions