Tunisian Arabish Corpus (TArC)
ClassificationPart-of-Speech (POS)TunisianBenchmark
The Tunisian Arabish Corpus (TArC) dataset is a Tunisian classification resource from Gugliotta et al. at 2020 comprising 4,79 examples.
📊 This dataset is used as an LLM benchmark. See model leaderboards →
About Tunisian Arabish Corpus (TArC)
Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.
Details
- Task
- Classification, Part-of-Speech (POS)
- Language
- Tunisian
- Format
- TSV
- Rows / instances
- 4,79
- Creator
- Gugliotta et al.
- Year
- 2020