List of French Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like French language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of French NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with French datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 12 French Datasets

Let’s get started!

ASAYAR
Dataset is used for extraction of text information from traffic panels. It consists of 3 sub-datasets: Arabic-Latin scene text localization, traffic sign detection, and directional symbol detection. The dataset contains 1,763 images collected on different Moroccan highways, and annotated manually, using 16 object categories. The fully annotated ASAYAR images contains more than 20,000 bounding box objects. [requires form completion]
CC100-French
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
CASS
Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
FQuAD
Dataset contains 25,000+ questions on a set of Wikipedia articles, modeled after SQuAD.
ASAYAR
Dataset is used for extraction of text information from traffic panels. It consists of 3 sub-datasets: Arabic-Latin scene text localization, traffic sign detection, and directional symbol detection. The dataset contains 1,763 images collected on different Moroccan highways, and annotated manually, using 16 object categories. The fully annotated ASAYAR images contains more than 20,000 bounding box objects. [requires form completion]
CC100-French
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G.
DiaBLa
Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality.
CASS
Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans)
Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text.
FQuAD
Dataset contains 25,000+ questions on a set of Wikipedia articles, modeled after SQuAD.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.