List of Arabic Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Arabic language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Arabic NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Arabic datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 48 Arabic Datasets

Let’s get started!

ASAYAR
Dataset is used for extraction of text information from traffic panels. It consists of 3 sub-datasets: Arabic-Latin scene text localization, traffic sign detection, and directional symbol detection. The dataset contains 1,763 images collected on different Moroccan highways, and annotated manually, using 16 object categories. The fully annotated ASAYAR images contains more than 20,000 bounding box objects. [requires form completion]
Arabic Dataset for Commonsense Validation 
Dataset was translated from the original English dataset for commonsense validation (Wang et al., 2019). Each example in the provided dataset is composed of 2 sentences: {s1, s2} and a label indicating which one is invalid.
CC100-Arabic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
Essex Arabic Summaries Corpus (EASC)
Dataset contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
Arabic in Business and Management Corpora (ABMC)
Dataset contains 400 Arab companies chairman and chief executive manager statements, 400 Arabic economic news articles, 400 Arabic stock market news articles.
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
ArabicWeb16
Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA).
Content-Based Categorized Dataset
Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
1.5 billion Words Arabic Corpus
The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years.
The Arabic Parallel Gender Corpus
Dataset is designed to support research on gender bias in natural language processing applications working on Arabic. Requires to submit application for approval.
Arabic Speech Corpus
Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
Khaleej-2004 Corpus
Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports.
Watan-2004 Corpus
Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports.
Parallel Arabic DIalectal Corpus (PADIC)
Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format.
Arabic Reading Comprehension Dataset (ARCD)
Dataset contains 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD) containing 48,344 questions.
Arabic Violence Twitter Corpus
Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval.
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
SemEvalCQA
Dataset for community question answering.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
SemEvalCQA
Dataset for community question answering.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
ASAYAR
Dataset is used for extraction of text information from traffic panels. It consists of 3 sub-datasets: Arabic-Latin scene text localization, traffic sign detection, and directional symbol detection. The dataset contains 1,763 images collected on different Moroccan highways, and annotated manually, using 16 object categories. The fully annotated ASAYAR images contains more than 20,000 bounding box objects. [requires form completion]
Arabic Dataset for Commonsense Validation 
Dataset was translated from the original English dataset for commonsense validation (Wang et al., 2019). Each example in the provided dataset is composed of 2 sentences: {s1, s2} and a label indicating which one is invalid.
CC100-Arabic
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 5.4G.
Essex Arabic Summaries Corpus (EASC)
Dataset contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
Arabic in Business and Management Corpora (ABMC)
Dataset contains 400 Arab companies chairman and chief executive manager statements, 400 Arabic economic news articles, 400 Arabic stock market news articles.
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
ArabicWeb16
Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA).
Content-Based Categorized Dataset
Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
1.5 billion Words Arabic Corpus
The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years.
The Arabic Parallel Gender Corpus
Dataset is designed to support research on gender bias in natural language processing applications working on Arabic. Requires to submit application for approval.
Arabic Speech Corpus
Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
Khaleej-2004 Corpus
Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports.
Watan-2004 Corpus
Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports.
Parallel Arabic DIalectal Corpus (PADIC)
Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format.
Arabic Reading Comprehension Dataset (ARCD)
Dataset contains 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD) containing 48,344 questions.
Arabic Violence Twitter Corpus
Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval.
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
SemEvalCQA
Dataset for community question answering.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Saudi Newspapers Corpus
Dataset contains 31,030 Arabic newspaper articles.
SemEvalCQA
Dataset for community question answering.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.