CALCS 2018 [MSA-EGY] - Named Entity Recognition on Code-switched Data

Organized by gaguilar - Current server time: Sept. 25, 2018, 5:15 p.m. UTC

Previous

PERPETUAL BENCHMARK
April 29, 2018, 10:15 p.m. UTC

Current

PERPETUAL BENCHMARK
April 29, 2018, 10:15 p.m. UTC

End

Competition Ends
April 19, 2018, 11 p.m. UTC

Welcome to the Modern Standard Arabic-Egyptian shared task!


Please cite the shared task paper with the following BibTex:

@inproceedings{calcs2018shtask,
    title={{Overview of the CALCS 2018 Shared Task: Named Entity Recognition on Code-switched Data}},
    author={Aguilar, Gustavo and AlGhamdi, Fahad and Soto, Victor and Diab, Mona and Hirschberg, Julia and Solorio, Thamar},
    publisher = {Association for Computational Linguistics},
    month={July},
    year={2018},
    address={Melbourne, Australia},
    booktitle = {Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching}
}

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

The third workshop on Computational Approaches on Linguistic Code-Switching (CALCS-2018) has prepared a shared task on Named Entity Recognition using Modern Standard Arabic-Egyptian code-switched data from social media. The goal is to allow participants to explore the use of supervised, semi-supervised and/or unsupervised approaches to predict the entity types of CS data. We believe that this effort will provide more resources to the increasing CS community.

Entity Types

  • Person
  • Location
  • Organization
  • Group
  • Title
  • Product
  • Event
  • Time
  • Other

Registration

Participants of the shared task will have to register on the official website of the workshop as well as request access to the CodaLab competition (see Participate tab). Additionally, participants will be required to submit the output of their systems within a pre-specified time window in order to qualify for evaluation in the shared task. They will also be required to submit a paper describing their system.

For more information, please visit the official website of the workshop.

Evaluation

We are going to evaluate your output predictions with the harmonic mean F1 metric. This is the standard way to evaluate NER tasks. Additionally, we include the Surface Forms F1 metric introduced in th Workshop on Noisy User-generated Text, W-NUT 2017 (Derczynski et al., 2017).

The leaderboard will show both the standard F1 and the F1 Surface Form. However, the ranking will be ordered by the average of those two metrics.

Terms and Conditions

By participating in the CALCS 2018 shared task, you have to agree with the following terms and conditions:

  • Your results will be publicly released in the proceedings of CALCS 2018.
  • You accept that your system will be ranked based on the evaluation metric proposed by the organizers.
  • The organizers reserve the right to disqualify any partificpant that do not follows the rules or do any suspicious activity.
  • You will provide a description paper of your system.

Task Details

This is a Named Entity Recognition shared task on social media data that presents the Modern Standard Arabic-Egyptian code-switching behavior. You will have to predict the right entity type using the IOB scheme for the following categories:

  • [BI]-PER: Person
  • [BI]-LOC: Location
  • [BI]-ORG: Organization
  • [BI]-GROUP: Group
  • [BI]-TITLE: Title
  • [BI]-PROD: Product
  • [BI]-EVENT: Event
  • [BI]-TIME: Time
  • [BI]-OTHER: Other
  • O: Any other token that is not an NE

Note that [BI] are the Beginning and Inside on each category. This describes whether a specific token is the start of an NE or if it's a subsequent token, in the case of a multi-word NE. You can find the annotation guidelines used for this data here.

External resources

Participants can use any resources (e.g., pre-trained word embeddings, gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no difference between with or without resources. However, we highly encourage participants to keep track of the perfomance when adding resources to include such insights in the paper.

Prediction Format

We provide the test set using the CoNLL format. We expect you to add the labels next to each token using a tab as a delimiter. Additionally, do not change the order of the lines in the test set because this could have a bad impact in your scores.

Submission

The evaluation script will expect your submission to be named as "calcs_msa_egy_preds.conll". Additionally, you will need to compress your submission file in order to upload it to CodaLab.

Finally, you will be able to submit your results as a team. As such, please use a team name that you would like to see in the proceedings of the workshop. To join/create a team, please follow the instructions here.

MSA-EGY

Start: March 23, 2018, midnight

Description: This is the NER Modern Standard Arabic-Egyptian competition.

PERPETUAL BENCHMARK

Start: April 29, 2018, 10:15 p.m.

Description: This is the NER Modern Standard Arabic-Egyptian perpetual benchmark phase.

Competition Ends

April 19, 2018, 11 p.m.

You must be logged in to participate in competitions.

Sign In