ADoBo — Automatic Detection of Borrowings

Organized by lea - Current server time: April 15, 2025, 12:13 p.m. UTC

First phase

Evaluation Phase
April 26, 2021, noon UTC

End

Competition Ends
May 18, 2021, 10 a.m. UTC

General Overview

ADoBo is the shared task on automatic detection of borrowings. We propose a shared task on detecting direct, unadapted, emerging borrowings in the Spanish press, i.e. detecting lexical borrowings that appear in the Spanish press and that have recently been imported into the Spanish language (words like running, smartwatch, youtuber or fake news).
 
The task will run from February 2021 to June 2021 and is part of IberLEF 2021, which will take place in September 2021 in Spain.
 

What is lexical borrowing?

Lexical borrowing is the process of importing words from one language into another language. Lexical borrowing is a phenomenon that affects all languages and is in fact a productive mechanism of word formation. During the last decades, English in particular has produced numerous lexical borrowings (often called anglicisms) in many European languages, especially in the press. It has been estimated that a reader of French newspapers encounters a new lexical borrowing every 1,000 words, English borrowings outnumbering all other borrowings combined. In Chilean newspapers, lexical borrowings account for approximately 30% of neologisms, 80% of those corresponding to English borrowings. In European Spanish, it has been estimated that anglicisms could account for 2% of the vocabulary used in Spanish newspaper El País in 1991, a number that is likely to be higher today.

As a result, the presence of English borrowings into Spanish has attracted lots of attention, both in Linguistics studies and among the general public.

Why is lexical borrowing detection interesting?

The task of automatically extracting unadapted lexical borrowings from text is relevant both for lexicographic purposes and for NLP downstream tasks. Borrowing detection has previously been used as a preprocessing step for parsing, text-to-speech synthesis and machine translation.

In the last years, several projects have approached the problem of extracting lexical borrowings in different European languages such, as German, Italian, French, Finnish or Norwegian, with a particular focus on anglicism extraction. Lately, work on anglicism detection in Spanish language has also been done for Argentinian Spanish and European Spanish.


What makes borrowing detection challenging?

The task of extracting emergent lexical borrowings is a more challenging undertaking than it might appear to be at first. To begin with, lexical borrowings can be either single or multitoken expressions (e.g., prime time, tie break or machine learning). Second, plain dictionary lookup can be an unreliable source for borrowing detection: after all, a term like social media is a borrowing in Spanish, even when the two tokens that form the term happen to be Spanish words that appear in Spanish dictionaries.

Finally, linguistic adaptation is a diachronic proccess and, as a result, what constitutes an unadapted borrowing is not clear-cut. For example, words like bar or club were unadapted lexical borrowings in Spanish at some point in the past, but have been around for so long in the Spanish language that the process of phonological and morphological adaptation is now complete and they cannot be considered unadapted borrowings anymore. On the other hand, realia words, that is, culture-specific elements whose name entered via the language of origin decades ago (like jazz or whisky) cannot be considered emergent anymore, even when their orthography has not been adapted into the Spanish spelling system.

ADoBo shared task

We propose a shared task on detecting direct, unadapted, emerging borrowings in the Spanish press, i.e. detecting lexical borrowings that appear in the Spanish press and that have recently been imported into the Spanish language (words like running, smartwatch, influencer or youtuber).

A corpus of Spanish newswire will be distributed among participants. The articles will be annotated with direct, unadapted, emerging lexical borrowings, i.e. lexical borrowings that have been imported into Spanish language and that haven't been assimilated yet (words such as look, hype, cliffhanger or lawfare). Borrowings will be annotated with BIO labels with two possible categories: ENG for English borrowings and OTHER for lexical borrowings from other languages (non lexical borrowings will have the tag O).

Only unadapted lexical borrowings will be considered. This means that borrowings that have already gone through orthographical or morphological adaption (such fútbol or hackear) will not be labeled as borrowings.

Participants will be provided with annotated versions of the training and development set, and an unannotated test set. Participants are expected to submit the annotated test set produced by their system. 

We have established certain limitations on what resources can and cannot be used for the shared task: no additional annotated data may be used for training and no automatically-compiled lexicons of borrowings (such as those produced by already-existing models that perform borrowing extraction) may be used. Any other normal lexicons, text corpora, along with any embeddings, contextual embeddings models like BERT, etc. are valid.

Evaluation

Submissions will be evaluated using the standard precision, recall and F1 over spans:

  • Precision: The percentage of borrowings in the system’s output that are correctly recognized and classified.
  • Recall: The percentage of borrowings in the test set that were correctly recognized and classified.
  • F1-measure: The harmonic mean of Precision and Recall.

F1-measure will be used as the official evaluation score, and will be used for the final ranking of the participating teams.

It should be noted that the evaluation for the final ranking will be done exclusively at the span level. This means that only full matches will be considered, and no credit will be given to partial matches, i.e. given the multitoken borrowing late night, the entire phrase would have to be correctly labeled in order to count as a true positive.

Results

 

RankTeamTagPrecisionRecallF1
1 marrouviere ALL
ENG
OTHER
88.81
90.70
47.06
81.56
82.65
52.17
85.03
86.49
49.48
2 marrouviere ALL
ENG
OTHER
89.40
90.98
45.45
66.30
67.55
32.61
76.14
77.54
37.97
3 marrouviere ALL
ENG
OTHER
92.28
93.43
38.89
61.40
63.12
15.22
73.75
75.34
21.88
4 versae ALL
ENG
OTHER
62.76
62.97
45.45
46.30
47.62
10.87
53.29
54.23
17.54
5 mgrafu ALL
ENG
OTHER
65.15
65.31
50.00
37.82
38.90
08.69
47.86
48.76
14.81
6 Neakail ALL
ENG
OTHER
75.27
75.43
60.00
27.47
28.25
6.52
40.25
41.10
11.76
7 Neakail ALL
ENG
OTHER
76.29
76.48
60.00
25.29
25.99
6.52
37.99
38.80
11.76
8 Neakail ALL
ENG
OTHER
76.44
76.64
60.00
24.75
25.42
6.52
37.39
38.18
11.76


How to participate?

ADoBo shared task is aimed at participants working in code-mixed data and those interested in the intersection of neology, lexicography and NLP.

Participants (and anyone interested in the topic of borrowing detection) are welcome to join the ADoBo Google group and post there any questions about the task.

 

 Organization Committee

 

Contact

Schedule

* February, 15: Sample set, Evaluation script and Annotation Guidelines released.

* March, 15: Training set released.

* April, 5: Development set released.

* May, 10: Test set released.

* May, 17: Systems output submissions.

* May, 21: Results posted and Test set with GS annotations released.

* May, 31: Working notes paper submission.

* June, 14: Notification of acceptance (peer-reviews).

* June, 28: Camera ready paper submission.

* September: IberLEF 2021 Workshop.

Terms

TBA

Evaluation Phase

Start: April 26, 2021, noon

Description: This is the first and only phase of the competition. Participants have to submit prediction files in the same format as the data provided for train or development in a file "results.txt" within a ZIP file. IMPORTANT: Check that the number of lines of your submission coincides with the testset and forget about Fail Status when submitting your file. We will check them and contact you if we find any problem.

Competition Ends

May 18, 2021, 10 a.m.

You must be logged in to participate in competitions.

Sign In