SemEval 2020 Task 3 - Predicting the (Graded) Effect of Context in Word Similarity

Organized by csantosarmendariz - Current server time: Feb. 27, 2020, 2:27 a.m. UTC


Evaluation Subtask1
Feb. 19, 2020, 3 a.m. UTC


Post-Evaluation Subtask2
March 12, 2020, midnight UTC


Post-Evaluation Subtask1
March 12, 2020, midnight UTC


For this tasks we ask participants to build systems that try to predict the effect that context has in human perception of similarity of words.

We have seen very interesting work that uses local context to predict discrete changes in meaning: the different senses of a polysemous word. However context also has more subtle, continuous (graded) effects on meaning, even for words not necessarily considered polysemous.


In order to be able to look at these effects we are building several datasets where we ask annotators to score how similar a pair of words are after they have read a short paragraph (which contains the two words). Each pair is scored within two of these paragraphs, allowing us to look at changes in similarity ratings due to context.

Let's see some examples from the instructions we give to annotators.

Example 1 - Room and Cell:

The meaning of words can be affected by the sentences and contexts in which we find them. In some cases, this is because different contexts make it clear that completely different senses of a word are being intended.

In Sentence 1 below, the words room and cell both refer to different kinds of room in a building:

Sentence 1: Her prision cell was almost an improvement over her room at the last hostel.

However, in sentence 2 below, the words are being used in different senses ('room' as an abstract concept and 'cell' as a biological term):

Sentence 2: His job as a biologist didn't leave much room for a personal life. He knew much more about human cells than about human feelings.

We would expect most people to agree that room and cell have much more similar meanings in Sentence 1 than they do in Sentence 2.

Example 2 - Population and People:

However, context can affect meaning in more subtle ways too, making us think of concepts in slightly different ways even if they have the same overall sense.
For example, in Sentence 1 below, the words 'population' and 'people' seem quite closely related, because we know that the population is made of people:

Sentence 1: The population of India is actually bigger than most people expect.

In contrast, in Sentence 2, the same words seem less similar, because this time we are talking about a population of bison - a group of animals rather than a group of people:

Sentence 2: The population of bison became a lot smaller when people settled in the valley.

Again, we would expect most people to agree that population and people have more similar meanings in Sentence 1 than in Sentence 2.

In the following tasks, we will ask you to select sentences which make the meanings of two words seem more and less similar.

Example 3 - From the pilot:

As we can see, even if these words are not particularly polysemous, there is a significant difference in the average score of similarity after reading each of the contexts. This could be related to the fact that the first context refers to "gazelle population".

We are building datasets, containing these contextual similarity ratings, in four (possibly five) different languages:

  • Croatian: HR
  • English: EN
  • Finnish: FI
  • Slovenian: SL
  • Estonian may be added to the list.

The pairs of words come from the well known SimLex999 dataset. The contexts will be chosen so as to encourage different perceptions of similarity. Polysemy will play a role, however we are especially interested in more subtle, graded changes in meaning.

You can read more details here:


The develpment of this task and the associated dataset is supported by the European Union Horizon 2020 research and innovation programme under Grant No. 825153, EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).


There are two different subtasks:

  1. Subtask 1: Predicting the change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts. This task directly addresses our main question. It evaluates how well systems are able to model the effect that context has in human perception of similarity.

  2. Subtask 2: Predicting the human scores of similarity for a pair of words within two different contexts. This is a more traditional task which evaluates systems' ability to model both similarity of words and the effect that context has on it.

Both are unsupervised tasks, we won't be releasing training data. Both use the same input data (pairs of words and contexts) but each of them has its own phases and leaderboards. This means the submissions are independent and you can use different models for each of the subtasks.

Practice Phase

Please be aware that the Practice Phase is still open to submissions and the Practice Kit and the Trial Data it contains has been updated. We think that this can be really helpful since there is no limit to the submissions to this phase, unlike the evaluation one. The trial data is made of three small samples of the actual English, Croatian and Slovene dataset (10, 5 and 5 pairs respectevely). The advantage is that in this case it includes the human annotation values which allows to see how the evaluation works and could serve as an initial estimation of how your model is doing. The kit includes as well a baseline based in multilingual Bert, the script to create it, the instructions given to the annotators and an example of the surveys used.

The Practice Kit folders contents:

  • data: Tab separated files containing the input pairs and contexts.

    word1 <tab> word2 <tab> context1 <tab> context2 <tab> word1_context1 <tab> word2_context1 <tab> word1_context2  <tab> word2_context2

    The additional fields contain the 'inflected' versions of the words as they appear in each of the contexts.
    For all languages the target words are additionally marked with a <strong></strong>.

  • gold: The gold standard human scores. These annotations won't be released until the end of the evaluation period.

    sim_context1 <tab> sim_context2 <tab> change = (sim_context2 - sim_context1)

  • submission_examples:

    Zip files containing different submissions that you can upload to Codalab to test the process.
  • res1: Example of submission format for subtask1

    change = (sim_context2 - sim_context1)

  • res2: Example of submission format for subtask2

    sim_context1 <tab> sim_context2

It also includes two python scripts in case you want to try the scoring offline.

Evaluation Phase

In order to participate in this phase of the task you need to download the Evaluation Kit. The kit is organized in a very similar way to the practice one, with the important difference of the kit not containing the human annotator gold standard. In order to evaluate your model's results you need to submit them to CodaLab. The results will be visible, however you have a maximum of 9 submissions per team/participant. Please make sure you are comfortable with the files formating and the CodaLab interface by using the practice phase first.

In the "data" folder you will find the 4 datasets which contain 340 English pairs, 112 Croatian pairs, 111 Slovene pairs and 24 Finnish pairs. Unfortunately we weren't able to include any Estonian pairs, however we still could see some additions when the whole dataset is released once the task is finished. Because we couldn't remove Estonian from the leaderboard, this language will show always a results of 0.

Evaluation Metrics

  • Subtask 1 - Predicting Change of Similarity Ratings: We will use Uncentered Pearson correlation against gold scores by human annotators.
  • Subtask 2 - Predicting Similarity Ratings in Context: We will use the harmonic mean of the Pearson and Spearman correlations against the gold scores by human annotators.

We will declare a winner per each of the languages, unfortunately we weren't able to create independent leader boards per language at Codalab, so for the purpose of the leader board there, we will order submissions based on the English results.

Submissions will be valid with any number of languages (scoring a 0 for the ones not included) however an English result file is mandatory, please do use the English baseline included in the kit if you don't want to work on this language.


Dates (Updated)

Please notice the dates for all SemEval2020 tasks have been updated

  • Test data ready December 3, 2019 February 19, 2020
  • Evaluation start January 10, 2020 February 19, 2020
  • Evaluation end January 31, 2020 March 11, 2020
  • System description paper submissions due February 23, 2020 April 17, 2020
  • SemEval workshop on September 13-14

Practice Subtask1

Start: Aug. 10, 2019, midnight

Description: Predicting the degree and direction of change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts.

Practice Subtask2

Start: Aug. 10, 2019, midnight

Description: Predicting the human scores of similarity for a pair of words within different contexts.

Evaluation Subtask1

Start: Feb. 19, 2020, 3 a.m.

Evaluation Subtask2

Start: Feb. 19, 2020, 3 a.m.

Post-Evaluation Subtask1

Start: March 12, 2020, midnight

Post-Evaluation Subtask2

Start: March 12, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 mutaz 0.573