This is the CodaLab Competition for SemEval 2020 Task 1 addressing the unsupervised detection of lexical semantic change, i.e., word sense changes over time, in text corpora of German, English, Latin and Swedish. The task is organized by Barbara McGillivray, Dominik Schlechtweg, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. The organisers can be reached at
semeval2020lexicalsemanticchange [at] turing dot ac dot uk
Please also register to the shared task google group for discussions and further information:
The past decade has seen a rise in academic work focussing on the computational tackling of lexical semantic change. An overview of the literature until 2018 is available in Tahmasebi et al. (2018) and Kutuzov et al. (2018). Nonetheless, most studies have different evaluation procedures and tackle different languages, corpora, and periods, making systems difficult to compare. This SemEval task aims to introduce a simple evaluation framework for unsupervised lexical semantic change detection.
There are two tasks: a classification task, and a ranking task. For each language, both tasks are based on the same two corpora, which span different periods. Corpora will be provided in four languages: German, English, Swedish, and Latin. Systems will be evaluated against a ground truth, as annotated (Schlechtweg et al., 2018) by human native speakers (except for Latin, which was annotated by scholars of Latin). Find a more detailed description on the 'Tasks' page.
For this competition, we allow only competition teams. (Teams may consist of only one member.) If your team has several members, please register each individually to the competition and build a team as described. If you are a single participant, please also build a team with a single member. A participant can only be part of one team; if you want to be part of several teams (e.g. because you want to submit several systems), please write us an email. A team can submit at most 10 times and only the final submission counts. The submission of each team member counts as a team submission. Read more about how to create and join a team here.
For an overview of the phases, please see the 'Phases' page.
Get started directly by registering to the competition and downloading the starting kit under 'Participate -> Files -> Starting Kit'.
We rely on the comparison of two time periods t1 and t2 for each language. This has two main advantages:
Participants are asked to solve two tasks, a classification and a ranking task (cf. Schlechtweg and Schulte im Walde, 2020). Participants are required to submit results for all four languages of the task.
Given two corpora C1 and C2 (for time periods t1 and t2), for a set of target words, decide which words lost or gained senses between t1 and t2, and which ones did not; as annotated by human judges.
Example: consider the word cell in Fig. 1. The task is to determine that the target word cell changed sense(s) between t1 and t2, as it gained the 'Phone' sense. A word is classified as gaining a sense, if the sense is attested at most k times in the annotation sample from C1, but attested at least n times in the sample from C2. (Similarly for words that lose a sense.) We set k=0, n=1 for the smaller samples (≤30) in Latin and k=2, n=5 for the larger samples (≤100) in English, German, Swedish. We make no distinction between words that gain vs. words that lose senses, both fall into the change class. Equally, we make no distinction between words that gain/lose one sense vs. words that gain/lose several senses.
Fig. 1 - Sense frequency distribution of cell in t1 and t2
Given two corpora C1 and C2, rank a set of target words according to their degree of lexical semantic change between t1 and t2, as annotated by human judges. A higher rank means stronger change.
Example: consider the word cell in Fig. 1. In task 2, we evaluate models' ability to capture fine-grained changes in the two sense frequency distributions. A word's degree of lexical semantic change is defined as the difference between its normalized sense frequency distributions for t1 and t2 (Fig. 1), measured by the Jensen-Shannon Distance (Lin, 1991; Donoso and Sanchez, 2017). Fig. 1 shows that the frequency of the sense 'Chamber' drops from t1 to t2, although it is not totally lost. Such a change increases the degree of lexical semantic change for task 2, but may not count as change in task 1 (depending on the thresholds k and n).
Participants are required to upload a zipped folder with the following naming and structure (find a sample submission in the trial data under 'Participate'):
The four files under
task1/ are tab-separated text files assigning predicted classes to target words. Each line contains an assignment in the following format:
word1 [tab] class1
word2 [tab] class2
Each file corresponds to one language and must assign either of the two classes 0 (stable) or 1 (change) to each target word of that language.
The four files under
task2/ are similar, but assigning predicted change scores to target words. Each line contains an assignment in the following format:
word1 [tab] score1
word2 [tab] score2
Each file corresponds to one language and must contain real-valued predictions for each target word for that language (no NaNs). High values mean a high rank, i.e., the highest value in a file will receive rank 1 (the target from the file with the highest degree of change). Ties (targets for which values are equal) are automatically corrected by assigning to them the average rank of all ties. (See also scipy.stats.rankdata.) Each file is then scored by calculating the Spearman rank correlation coefficient between the submitted scores and the true scores for the respective language. Note that for Spearman correlation only the ranking of the targets influences the results, not the exact values. Thus, a submission will receive the high-score 1.0 if the predicted values rank the target words exactly as the true rank. On the contrary, it will receive the lowest possible score -1.0 if the predicted values rank the target words exactly opposite to the true rank.
The parent folder
answer/ must be zipped before uploading it. Find a sample submission in the trial data under 'Participate'. Make sure that the folder you zip is exactly named 'answer', that all your files are UTF-8 encoded and that there are no additional lines (e.g. after the last line). A frequent issue is the use of text editors whose default text encoding is not UTF-8 (e.g. Notepad on a Windows machine).
There are two model baselines for the shared task:
FD first calculates the frequency for each target word in each of the two corpora, normalizes it by the total corpus frequency and then calculates the absolute difference in these values as a measure of change. CNT+CI+CD first learns vector representations for each of the two corpora, then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. Find an implementation of these models under 'Participate -> Files -> Starting Kit'; find more information in Schlechtweg et. al (2019).
Moreover, task 1 has a random baseline: the accuracy of a classifier always predicting the majority class (≈0.5).
You can download a starting kit containing trial data and implementations of the baselines for the shared task by clicking on 'Participate -> Files -> Starting Kit'. For each language, the trial data contains:
Important: The scores in
truth/task2/ are not meaningful as they were randomly assigned.
You can start by uploading the zipped answer folder to the system to check the submission and evaluation format. (Find more information on the submission format under 'Learn the Details -> Evaluation'.) You may further use the baseline implementations to develop your model.
We provide an updated starting kit containing a script that automatically downloads the test data and runs the baselines. The test data can be manually downloaded here:
It has the same format as the trial data with the difference that no true classification and ranking is provided. For each language we provide the target words and a corpus pair.
In the starting kit we also provide a sample submission for the test target words with randomly assigned scores (
In the post-evaluation phase the true classification and ranking for the evaluation data will be provided.
We provide two time-specific corpora for each language. Participants are required to predict the lexical semantic change of the target words between these two corpora. Each line contains one sentence and has the form
lemma1 lemma2 lemma3...
Punctuation, empty and one-word sentences have been removed and each token has been replaced by its lemma. Sentences have been randomly shuffled within each corpus. Lemmas were not lower-cased, i.e., for German, Latin and Swedish there are lower- as well as upper-case lemmas (see also Particularities below). Find gzipped samples of these corpora in the trial data under
trial_data_public/corpora/. The trial corpora have the same format as the ones used in the evaluation phase.
The tasks focus on the following time periods.
English: We use CCOHA (Alatrash et al., 2020), a cleaned version of the COHA corpus (Davies, 2002). The corpus has been transformed by replacing ten words with "@" every 200 words. We split sentences at these tokens and removed them.
German: We use the DTA corpus (Deutsches Textarchiv, 2017) and a combination of the BZ and ND corpora (Berliner Zeitung, 2018; Neues Deutschland, 2018). BZ and ND contain frequent OCR errors.
Latin: We use the LatinISE corpus (McGillivray & Kilgarriff, 2013). The data were automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the "#" symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma "dico" corresponds to the first homonym in the Lewis-Short dictionary and "dico#2" corresponds to the second homonym, cf. Lewis-Short dictionary.
Swedish: We use the KubHist corpus, digitized by the National Library of Sweden, and available through the Språkbanken corpus infrastructure Korp (Borin et al., 2012). The corpus is available through a CC BY (attribution) license. Kubhist is studied in larger detail and more information can be found in (Adesam et al., 2019). Each word for which the lemmatizer has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.
The corpora used in this task are strongly pre-processed and randomized versions of the original corpora and are made freely available. The original authors retain their respective rights, where applicable. We encourage the participants to familiarise themselves with the license included in the data downloads. The manually annotated datasets will be made available by the organisers of the present SemEval task with a different license still to be determined.
With the registration to this competition you consent to use only the evaluation data published by the organizers (as specified on the 'Evaluation' page) as input to the system for which you submit results and that your team will be excluded from the competition if you do otherwise. You agree not to redistribute the evaluation data except in the manner prescribed by its licence. (See 'Data' page for details.)
You further agree that if your team has several members, each of them will register to the competition and build a competition team (as described on the 'Overview' page) and that if you are a single participant you will build a team with a single member.
By submitting results to this competition, you consent to the public release of your scores at the SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatically and manually calculated quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
Start: July 1, 2019, midnight
Description: Check out the website, register and build a team (see 'Overview' page), implement your model and upload results for trial data. The upload number is unlimited and the leaderboard public. Note that after submissions reach the status 'Finished' you need to refresh the page to see your score (score = average score on task 1). You will have to push submissions to the leaderboard to see full results.
Start: Feb. 19, 2020, 10 p.m.
Description: The evaluation data (targets, corpora) is released by February 19. Apply your model to the corpora to create predictions and upload your final submission. The upload number is limited to 10 and the leaderboard is hidden. Only the final valid submission will be taken as the official submission to the competition.
Start: March 12, 2020, midnight
Description: Analyze your model predictions. Gold evaluation data is publicly available. The upload number is unlimited and the leaderboard public.
You must be logged in to participate in competitions.Sign In