Classify and extract text 10x better and faster 🦾


➡️  Learn more

Books Corpus Dataset

Created by Tiedemann at 2012, the Books Corpus Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens., in Multi-Lingual language. Containing 0.91M in XCES, XML file format.

Dataset Sources

Here you can download the Books Corpus dataset in XCES, XML format.

Download Books Corpus dataset XCES, XML files

Fine-tune with Books Corpus dataset

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models

➡️  Learn more

Paper

Read full original Books Corpus paper.

Download PDF paper


Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.