Skip to content

Chinese Datasets

We catalog 58 Chinese datasets for NLP and machine learning. Browse the list below or narrow down by task.

This page covers Chinese (Mandarin), the most spoken first language in the world and a major focus of multilingual NLP. Our directory includes 58 datasets in Chinese.

Updated June 2026

What tasks do Chinese datasets cover?

Datasets in other languages

Frequently asked questions