TigerResearch/pretrain_zh
General NLPEnglishBenchmark
The TigerResearch/pretrain_zh dataset is a English General NLP resource from TigerResearch at 2023.
📊 This dataset is used as an LLM benchmark. See model leaderboards →
About TigerResearch/pretrain_zh
Dataset Card for "pretrain_zh"
Tigerbot pretrain数据的中文部分。
包含(未压缩前) 中文书籍zh-books 12G, 中文互联网zh-webtext 25G, 中文百科zh-wiki 19G
更多语料请关注开源模型及持续更新 https://github.com/TigerResearch/TigerBot
Usage
import datasets
ds_sft = datasets.load_dat...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- TigerResearch
- Year
- 2023