Skip to content

Tunisian Arabish Corpus (TArC)

ClassificationPart-of-Speech (POS)TunisianBenchmark

The Tunisian Arabish Corpus (TArC) dataset is a Tunisian classification resource from Gugliotta et al. at 2020 comprising 4,79 examples.

📊 This dataset is used as an LLM benchmark. See model leaderboards →

About Tunisian Arabish Corpus (TArC)

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.

Details

Task
Classification, Part-of-Speech (POS)
Language
Tunisian
Format
TSV
Rows / instances
4,79
Creator
Gugliotta et al.
Year
2020
Download Paper

FAQ