Books Corpus Dataset
Created by Tiedemann at 2012, the Books Corpus Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens., in Multi-Lingual language. Containing 0.91M in XCES, XML file format.
Dataset Sources
Here you can download the Books Corpus dataset in XCES, XML format.
Download Books Corpus dataset XCES, XML files
Fine-tune with Books Corpus dataset
Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
Paper
Read full original Books Corpus paper.
Classify and extract text 10x better and faster 🦾
Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.