Arabic has a wide variety of dialects, many of which remain under-studied primarily due to lack of data. The goal of the Nuanced Arabic Dialect Identification (NADI) is to alleviate this bottleneck by affording the community with diverse data from 21 Arab countries. The data can be used for modeling dialects, and NADI focuses on dialect identification. Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. Previous work on Arabic dialect identification has focused on coarse-grained regional varieties such as Gulf or Levantine (e.g., Zaidan and Callison-Burch, 2013; Elfardy and Diab, 2013; Elaraby and Abdul-Mageed, 2018) or country-level varieties (e.g., Bouamor et al., 2018; Zhang and Abdul-Mageed, 2019) such as the MADAR shared task in WANLP 2019 (Bouamor, Hassan, and Habash, 2019). The MADAR shared task also involved city-level classification on human translated data. Abdul-Mageed, Zhang, Elmadany, and Ungar (2020) also developed models for detecting city-level variation. NADI aims at maintaining this theme of modeling fine-grained variation.
NADI targets province-level dialects, and as such is the first to focus on naturally-occurring fine-grained dialect at the sub-country level. The NADI 2020 shared was held with WANLP 2020 (Abdul-Mageed, Zhang, Bouamor, and Habash, 2020). The NADI 2021 shared task will be held with WANLP@EACL2021 and will continue to focus on fine-grained dialects with new datasets and efforts to distinguish both modern standard Arabic (MSA) and dialects (DA) according to their geographical origin. The data covers a total of 100 provinces from all 21 Arab countries and come from the Twitter domain. Evaluation and task set up follow the NADI 2020 shared task. The subtasks involved include:
(To receive access to the data, teams intending to participate are invited to fill in the form on the official website of NADI shared task. )
The subtasks involved include:
Participants will also be provided with an additional 10M unlabeled tweets that can be used in developing their systems for either or both of the tasks.
The evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.
Participating teams will be provided with a common training data set and a common development set. No external manually labelled data sets are allowed. A blind test data set will be used to evaluate the output of the participating teams. All teams are required to report on the development and test set in their writeups.
Please visit the official website of the NADI shared task for more information.
For any questions related to this task, please contact the organizers directly using the following email address: firstname.lastname@example.org
Metrics: The evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.
To receive access to the data, teams intending to participate are invited to fill in the form on the official website.
Copyright (c) 2021 The University of British Columbia, Canada; Carnegie Mellon University Qatar; New York University Abu Dhabi. All rights reserved.
Start: Dec. 15, 2020, noon
Description: Development phase: Develop your models and submit prediction labels on the DEV set of subtask 1. Note: The name of your submission should be 'teamname_subtask11_dev_numberOFsubmission.zip' that includes a text file of your prediction (e.g., A submission 'UBC_subtask11_dev_1.zip' that is the zip file of my first prediction, 'UBC_subtask11_dev_1.txt'.)
Start: Dec. 27, 2020, noon
Description: Test phase: Submit your prediction labels on the TEST set of subtask 1. Each team is allowed a maximum of 3 submissions. Note: The name of your submission should be 'teamname_subtask11_test_numberOFsubmission.zip' that includes a text file of your predictions (e.g., A submission 'UBC_subtask11_test_1.zip' that is the zip file of my prediction, 'UBC_subtask11_test_1.txt'.)
Start: Jan. 29, 2021, 11:59 a.m.
Description: Post-Evaluation: Submit your prediction on the TEST set of subtask 1 after the competition deadline. The name of your submission should be 'teamname_subtask11_test_numberOFsubmission.zip' that includes a text file of your predictions (e.g., A submission 'UBC_subtask11_test_1.zip' that is the zip file of my prediction, 'UBC_subtask11_test_1.txt'.)
You must be logged in to participate in competitions.Sign In