List of Chinese Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of Chinese NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Chinese datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 64 Chinese Datasets

Let’s get started!

MedDialog
Dataset contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances.
Ant Financial Question Matching Corpus (AFQMC) (CLUE Benchmark)
Dataset is a binary classification task that aims to predict whether two sentences are semantically similar.
TouTiao Text Classification for News Titles (TNEWS) (CLUE Benchmark)
Dataset consists of Chinese news published by TouTiao before May 2018, with a total of 73,360 titles. Each title is labeled with one of 15 news categories (finance, technology, sports, etc.) and the task is to predict which category the title belongs to.
IFLYTEK (CLUE Benchmark)
Dataset contains 17,332 long text annotation data about app application descriptions, including various application topics related to daily life. The task is to classify the descriptions from 119 categories.
The Chinese Winograd Schema Challenge (CLUEWSC2020) (CLUE Benchmark)
Dataset is an anaphora/coreference resolution task where the model is asked to decide whether a pronoun or noun (phrase) in a sentence co-refer. Data comes from 36 contemporary literary works in Chinese.
The Chinese Science and Technology Literature Data Set (CSL) (CLUE Benchmark)
Dataset is taken from the abstracts of Chinese papers and their keywords. The papers are selected from some core journals of Chinese social sciences and natural sciences. Use tf-idf to generate a mixture of fake keywords and real keywords in the paper to construct abstract-keyword pairs. The task goal is to judge whether the keywords are all real keywords based on the abstract.
ChID (CLUE Benchmark)
Dataset is a Chinese idiom cloze test dataset which contains 498,611 passages with 623,377 blanks covered from news, novels and essays.
CC100-Chinese (Simplified)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Chinese (Traditional)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
RiSAWOZ
Dataset contains 11.2K human-to-human (H2H) multiturn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
LiveQA
Dataset was constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu1 website.
MedDG
Dataset contains more than 17K conversations collected from the online health consultation community relating to 12 types of common Gastrointestinal diseases. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels.
C3
Dataset is first free-form multipleChoice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second language examinations.
DialogRE
Dataset contains human-annotated dialogue-based relation extraction containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy "Friends". There are 36 possible relation types that exist between an argument pair in a dialogue.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
Open-Domain Spoken Question Ansswering Dataset (ODSQA)
Dataset contains questions in both text and spoken forms, a multi-sentence spoken-form document and a word span answer based from the document.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
LCSTS
Dataset constructed from the Chinese microblogging website Sina Weibo. It consists of over 2 million real Chinese short texts with short summaries given by the author of each text. Requires application.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
KdConv
Dataset is a Chinese multi-domain dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
Chinese Machine Reading Comprehension (CMRC)
Dataset (cloze style) contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories.
CrossWOZ
Dataset is a cross-domain wizard-of-oz task-oriented dataset. It contains dialogue sessions and utterances for 5 domains: hotel, restaurant, attraction, metro, and taxi.
Chinese Machine Reading Comprehension (CMRC 2018)
Dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts.
Delta Reading Comprehension Dataset
Dataset organizes 10,014 paragraphs from 2,108 wiki entries and highlights more than 30,000 questions from the paragraphs.
NLP Chinese Corpus
Large text corpora in Chinese.
Tencent AI Lab Embedding Corpus
Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases.
NLP Chinese Corpus
Large text corpora in Chinese.
Tencent AI Lab Embedding Corpus
Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
RiSAWOZ
Dataset contains 11.2K human-to-human (H2H) multiturn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
MedDialog
Dataset contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances.
Ant Financial Question Matching Corpus (AFQMC) (CLUE Benchmark)
Dataset is a binary classification task that aims to predict whether two sentences are semantically similar.
TouTiao Text Classification for News Titles (TNEWS) (CLUE Benchmark)
Dataset consists of Chinese news published by TouTiao before May 2018, with a total of 73,360 titles. Each title is labeled with one of 15 news categories (finance, technology, sports, etc.) and the task is to predict which category the title belongs to.
IFLYTEK (CLUE Benchmark)
Dataset contains 17,332 long text annotation data about app application descriptions, including various application topics related to daily life. The task is to classify the descriptions from 119 categories.
The Chinese Winograd Schema Challenge (CLUEWSC2020) (CLUE Benchmark)
Dataset is an anaphora/coreference resolution task where the model is asked to decide whether a pronoun or noun (phrase) in a sentence co-refer. Data comes from 36 contemporary literary works in Chinese.
The Chinese Science and Technology Literature Data Set (CSL) (CLUE Benchmark)
Dataset is taken from the abstracts of Chinese papers and their keywords. The papers are selected from some core journals of Chinese social sciences and natural sciences. Use tf-idf to generate a mixture of fake keywords and real keywords in the paper to construct abstract-keyword pairs. The task goal is to judge whether the keywords are all real keywords based on the abstract.
ChID (CLUE Benchmark)
Dataset is a Chinese idiom cloze test dataset which contains 498,611 passages with 623,377 blanks covered from news, novels and essays.
CC100-Chinese (Simplified)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100-Chinese (Traditional)
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
RiSAWOZ
Dataset contains 11.2K human-to-human (H2H) multiturn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
LiveQA
Dataset was constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu1 website.
MedDG
Dataset contains more than 17K conversations collected from the online health consultation community relating to 12 types of common Gastrointestinal diseases. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels.
C3
Dataset is first free-form multipleChoice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second language examinations.
DialogRE
Dataset contains human-annotated dialogue-based relation extraction containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy "Friends". There are 36 possible relation types that exist between an argument pair in a dialogue.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
Open-Domain Spoken Question Ansswering Dataset (ODSQA)
Dataset contains questions in both text and spoken forms, a multi-sentence spoken-form document and a word span answer based from the document.
NEJM-enzh
Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM).
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong
Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
LCSTS
Dataset constructed from the Chinese microblogging website Sina Weibo. It consists of over 2 million real Chinese short texts with short summaries given by the author of each text. Requires application.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
KdConv
Dataset is a Chinese multi-domain dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
Chinese Machine Reading Comprehension (CMRC)
Dataset (cloze style) contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories.
CrossWOZ
Dataset is a cross-domain wizard-of-oz task-oriented dataset. It contains dialogue sessions and utterances for 5 domains: hotel, restaurant, attraction, metro, and taxi.
Chinese Machine Reading Comprehension (CMRC 2018)
Dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts.
Delta Reading Comprehension Dataset
Dataset organizes 10,014 paragraphs from 2,108 wiki entries and highlights more than 30,000 questions from the paragraphs.
NLP Chinese Corpus
Large text corpora in Chinese.
Tencent AI Lab Embedding Corpus
Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases.
NLP Chinese Corpus
Large text corpora in Chinese.
Tencent AI Lab Embedding Corpus
Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
CC100
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.3G.
RiSAWOZ
Dataset contains 11.2K human-to-human (H2H) multiturn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.