Question 1

What is the Tunisian Arabish Corpus (TArC) dataset?

Accepted Answer

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.

Question 2

Is Tunisian Arabish Corpus (TArC) a benchmark?

Accepted Answer

Yes — Tunisian Arabish Corpus (TArC) is used as an LLM benchmark. See model leaderboards in the Benchmarks section.

Question 3

Where can I download Tunisian Arabish Corpus (TArC)?

Accepted Answer

Tunisian Arabish Corpus (TArC) is available at its source: https://github.com/eligugliotta/tarc.

Tunisian Arabish Corpus (TArC)

About Tunisian Arabish Corpus (TArC)

Details

FAQ