SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Organized by dschlechtweg - Current server time: Sept. 19, 2019, 6:47 p.m. UTC

Current

Practice
July 1, 2019, midnight UTC

Next

Evaluation
Jan. 10, 2020, midnight UTC

End

Competition Ends
Never

Welcome!

This is the CodaLab Competition for SemEval 2020 Task 1 addressing the unsupervised detection of lexical semantic change, i.e., word sense changes over time, in text corpora of German, English, Latin and Swedish. The task is organized by Barbara McGillivray, Dominik Schlechtweg, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. The organisers can be reached at

semeval2020lexicalsemanticchange [at] turing dot ac dot uk

The past decade has seen a rise in academic work focussing on the computational tackling of lexical semantic change. An overview of the literature until 2018 is available in Tahmasebi et al. (2018) and Kutuzov et al. (2018). Nonetheless, most studies have different evaluation procedures and tackle different languages, corpora, and periods, making systems difficult to compare. This SemEval task aims to introduce a simple evaluation framework for unsupervised lexical semantic change detection.

Tasks

There are two tasks: a classification task, and a ranking task. For each language, both tasks are based on the same two corpora, which span different periods. Corpora will be provided in four languages: German, English, Swedish, and Latin. Systems will be evaluated against a ground truth, as annotated (Schlechtweg et al., 2018) by human native speakers (except for Latin, which was annotated by scholars of Latin). Find a more detailed description on the 'Tasks' page.

Registration

For this competition, we allow only competition teams. (Teams may consist of only one member.) If your team has several members, please register each individually to the competition and build a team as described. If you are a single participant, please also build a team with a single member. A participant can only be part of one team; if you want to be part of several teams (e.g. because you want to submit several systems), please write us an email. A team can submit at most 10 times and only the final submission counts. The submission of each team member counts as a team submission. Read more about how to create and join a team here.

Phases

For an overview of the phases, please see the 'Phases' page.

Timeline

  • Trial data ready July 31, 2019
  • Test data ready December 3, 2019
  • Evaluation start January 10, 2020
  • Evaluation end January 31, 2020
  • Paper submission due February 23, 2020
  • Notification to authors March 29, 2020
  • Camera ready due April 5, 2020
  • SemEval workshop Summer 2020

References

Get Started

Get started directly by registering to the competition and downloading the starting kit under 'Participate -> Files -> Starting Kit'.

Tasks

We rely on the comparison of two time periods t1 and t2 for each language. This has two main advantages:

  1. it reduces the number of time periods for which data has to be annotated, so we can annotate larger corpus samples and hence more reliably represent the sense frequency distributions of the target words;
  2. it reduces the task complexity, allowing different model architectures to be applied to it, widening the range of possible participants.

Participants are asked to solve two tasks, a classification and a ranking task. Participants are required to submit results for all four languages of the task.

Task 1: Binary classification

Given two corpora C1 and C2 (for time periods t1 and t2), for a set of target words, decide which words lost or gained senses between t1 and t2, and which ones did not; as annotated by human judges.

Example: consider the word cell in Fig. 1. The task is to determine that the target word cell changed sense(s) between t1 and t2, as it gained the 'Phone' sense. A word is classified as gaining a sense, if the sense is never attested in C1, but attested at least k times in C2, where k is a small number <10 announced at the beginning of the evaluation phase. (Similarly for words that lose a sense.) We make no distinction between words that gain vs. words that lose senses, both fall into the change class. Equally, we make no distinction between words that gain/lose one sense vs. words that gain/lose several senses.


Fig. 1 - Sense frequency distribution of cell in t1 and t2

Task 2: Ranking

Given two corpora C1 and C2, rank a set of target words according to their degree of lexical semantic change between t1 and t2, as annotated by human judges. A higher rank means stronger change.

Example: consider the word cell in Fig. 1. In task 2, we evaluate models' ability to capture fine-grained changes in the two sense frequency distributions. A word's degree of lexical semantic change is defined as the difference between its normalized sense frequency distributions for t1 and t2 (Fig. 1), measured by the Jensen-Shannon Distance (Lin, 1991; Donoso and Sanchez, 2017). Fig. 1 shows that the frequency of the sense 'Chamber' drops from t1 to t2, although it is not totally lost. Such a change increases the degree of lexical semantic change for task 2, but does not count as change in task 1.

References

Evaluation

Metrics

  • Task 1 (Binary Classification): Participating systems will be evaluated using Accuracy (ACC) against the true binary classification as annotated by humans.
  • Task 2 (Ranking): Participating systems will be evaluated using Spearman's rank-order correlation coefficient (SPR) against the true rank as annotated by humans.

Submission Format

Participants are required to upload a zipped folder with the following naming and structure (find a sample submission in the trial data under 'Participate'):

  • answer/
    • task1/
      • english.txt
      • german.txt
      • latin.txt
      • swedish.txt
    • task2/
      • english.txt
      • german.txt
      • latin.txt
      • swedish.txt

The four files under task1/ are tab-separated text files assigning predicted classes to target words. Each line contains an assignment in the following format:

word1 [tab] class1
word2 [tab] class2
...

Each file corresponds to one language and must assign either of the two classes 0 (stable) or 1 (change) to each target word of that language.

The four files under task2/ are similar, but assigning predicted change scores to target words. Each line contains an assignment in the following format:

word1 [tab] score1
word2 [tab] score2
...

Each file corresponds to one language and must contain real-valued predictions for each target word for that language (no NaNs). High values mean a high rank, i.e., the highest value in a file will receive rank 1 (the target from the file with the highest degree of change). Ties (targets for which values are equal) are automatically corrected by assigning to them the average rank of all ties. (See also scipy.stats.rankdata.) Each file is then scored by calculating the Spearman rank correlation coefficient between the submitted scores and the true scores for the respective language. Note that for Spearman correlation only the ranking of the targets influences the results, not the exact values. Thus, a submission will receive the high-score 1.0 if the predicted values rank the target words exactly as the true rank. On the contrary, it will receive the lowest possible score -1.0 if the predicted values rank the target words exactly opposite to the true rank.

The parent folder answer/ must be zipped before uploading it. Find a sample submission in the trial data under 'Participate'. Make sure that the folder you zip is exactly named 'answer', that all your files are UTF-8 encoded and that there are no additional lines (e.g. after the last line). A frequent issue is the use of text editors whose default text encoding is not UTF-8 (e.g. Notepad on a Windows machine).

Important:

  • Participants are required to submit results for both tasks and all four languages.
  • Participants' systems are required to only use the published evaluation data (target words, corpora) as input. The use of pre-trained embeddings is allowed, but only when trained in a completely unsupervised way, i.e., not on any manually annotated data.
  • The leaderboard will only show the latest submission of each participant. Only the final submission to the leaderboard will count as participants' result for the shared task.

Baselines

There are two model baselines for the shared task:

  1. normalized frequency difference (FD)
  2. count vectors with column intersection and cosine distance (CNT+CI+CD)

FD first calculates the frequency for each target word in each of the two corpora, normalizes it by the total corpus frequency and then calculates the absolute difference in these values as a measure of change. CNT+CI+CD first learns vector representations for each of the two corpora, then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. Find an implementation of these models under 'Participate -> Files -> Starting Kit'; find more information in Schlechtweg et. al (2019).

Moreover, task 1 has a random baseline: the accuracy of a classifier always predicting the majority class (≈0.5).

References

Data

Phase 1: Practice

You can download a starting kit containing trial data and implementations of the baselines for the shared task by clicking on 'Participate -> Files -> Starting Kit'. For each language, the trial data contains:

  • trial target words for which predictions can be submitted in the practice phase (targets/)
  • the true classification of the trial target words for task 1 in the practice phase, i.e., the file against which submissions will be scored in the practice phase (truth/task1/)
  • the true ranking of the trial target words for task 2 in the practice phase (truth/task2/)
  • a sample submission for the trial target words in the above-specified format (answer.zip/)
  • two trial corpora from which you may predict change scores for the trial target words (corpora/)

Important: The scores in truth/task1/ and truth/task2/ are not meaningful as they were randomly assigned.

You can start by uploading the zipped answer folder to the system to check the submission and evaluation format. (Find more information on the submission format under 'Learn the Details -> Evaluation'.) You may further use the baseline implementations to develop your model.

Phase 2: Evaluation

The data for the evaluation phase will be published here. It has the same format as the trial data with the difference that no true classification and ranking and no sample submission is provided. For each language we provide the target words and a corpus pair. Each corpus file will have a maximum size <1GB.

Phase 3: Post-evaluation

In the post-evaluation phase the true classification and ranking for the evaluation data will be provided.

Corpora

In the evaluation phase, we provide two time-specific corpora for each language. Participants are required to predict the lexical semantic change of the target words between these two corpora. Each line contains one sentence and has the form

lemma1 lemma2 lemma3...

Punctuation, empty and one-word sentences have been removed and each token has been replaced by its lemma. Sentences have been randomly shuffled within each corpus. Lemmas were not lower-cased, i.e., for German, Latin and Swedish there are lower- as well as upper-case lemmas (see also Particularities below). Find gzipped samples of these corpora in the trial data under trial_data_public/corpora/. The trial corpora have the same format as the ones which will be used in the evaluation phase. Note that we will apply additional preprocessing to the evaluation corpora, such as deleting low-frequency words (to be further specified upon release).

Time Periods

The tasks will focus on the following time periods.

     t1        t2    
    English         1810-1860         1960-2010    
    German         1810-1860         1946-1990    
    Latin         -200-0         0-2000    
    Swedish         1800-1830         1900-1925    

In the case of Latin, the dates correspond to centuries and are shown using the following convention: for centuries before Christ, a negative number corresponding to the first year of the century is shown, for example "-100" for the first century BC. For centuries after Christ, a positive number corresponding to the first year of the century is shown, for example "1200" for the 13th century AD.

Particularities

English: We use a cleaned version of the COHA corpus (Davies, 2002). The corpus has been transformed by replacing ten words with "@" every 200 words. We split sentences at these tokens and removed them.

German: We use the DTA corpus (Deutsches Textarchiv, 2017) and a combination of the BZ and ND corpora (Berliner Zeitung, 2018; Neues Deutschland, 2018). BZ and ND contain frequent OCR errors.

Latin: We use the LatinISE corpus (McGillivray & Kilgarriff, 2013). The data were automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the "#" symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma "dico" corresponds to the first homonym in the Lewis-Short dictionary and "dico#2" corresponds to the second homonym, cf. Lewis-Short dictionary.

Swedish: We use the KubHist corpus, digitized by the National Library of Sweden, and available through the Språkbanken corpus infrastructure Korp (Borin et al., 2012). The corpus is available through a CC BY (attribution) license. Kubhist is studied in larger detail and more information can be found in (Adesam et al., 2019). Each word for which the lemmatizer has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.

Licenses

The corpora used in this task are strongly pre-processed and randomized versions of the original corpora and are made freely available. The original authors retain their respective rights, where applicable. We encourage the participants to familiarise themselves with the license included in the data downloads. The manually annotated datasets will be made available by the organisers of the present SemEval task with a different license still to be determined.

References

Terms and Conditions

With the registration to this competition you consent to use only the evaluation data published by the organizers (as specified on the 'Evaluation' page) as input to the system for which you submit results and that your team will be excluded from the competition if you do otherwise. You agree not to redistribute the evaluation data except in the manner prescribed by its licence. (See 'Data' page for details.)

You further agree that if your team has several members, each of them will register to the competition and build a competition team (as described on the 'Overview' page) and that if you are a single participant you will build a team with a single member.

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2020 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatically and manually calculated quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

Practice

Start: July 1, 2019, midnight

Description: Check out the website, register and build a team (see 'Overview' page), implement your model and upload results for trial data. The upload number is unlimited and the leaderboard public. Note that after submissions reach the status 'Finished' you need to refresh the page to see your score (score = average score on task 1). You will have to push submissions to the leaderboard to see full results.

Evaluation

Start: Jan. 10, 2020, midnight

Description: The evaluation data (targets, corpora) is released by December 3. Apply your model to the corpora to create predictions and upload your final submission. The upload number is limited to 10 and the leaderboard is hidden.

Post-Evaluation

Start: Feb. 1, 2020, midnight

Description: Analyze your model predictions. Gold evaluation data is publicly available. The upload number is unlimited and the leaderboard public.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In