SemEval 2020 Task 3 - Predicting the (Graded) Effect of Context in Word Similarity

Organized by csantosarmendariz - Current server time: Jan. 19, 2020, 11:53 p.m. UTC


Practice Subtask1
Aug. 10, 2019, midnight UTC


Post-Evaluation Subtask2
March 12, 2020, midnight UTC


Evaluation Subtask1
Feb. 18, 2020, midnight UTC


For this tasks we ask participants to build systems that try to predict the effect that context has in human perception of similarity of words.

We have seen very interesting work that uses local context to predict discrete changes in meaning: the different senses of a polysemous word. However context also has more subtle, continuous (graded) effects on meaning, even for words not necessarily considered polysemous.


In order to be able to look at these effects we are building several datasets where we ask annotators to score how similar a pair of words are after they have read a short paragraph (which contains the two words). Each pair is scored within two of these paragraphs, allowing us to look at changes in similarity ratings due to context.

Let's see some examples from the instructions we give to annotators.

Example 1 - Room and Cell:

The meaning of words can be affected by the sentences and contexts in which we find them. In some cases, this is because different contexts make it clear that completely different senses of a word are being intended.

In Sentence 1 below, the words room and cell both refer to different kinds of room in a building:

Sentence 1: Her prision cell was almost an improvement over her room at the last hostel.

However, in sentence 2 below, the words are being used in different senses ('room' as an abstract concept and 'cell' as a biological term):

Sentence 2: His job as a biologist didn't leave much room for a personal life. He knew much more about human cells than about human feelings.

We would expect most people to agree that room and cell have much more similar meanings in Sentence 1 than they do in Sentence 2.

Example 2 - Population and People:

However, context can affect meaning in more subtle ways too, making us think of concepts in slightly different ways even if they have the same overall sense.
For example, in Sentence 1 below, the words 'population' and 'people' seem quite closely related, because we know that the population is made of people:

Sentence 1: The population of India is actually bigger than most people expect.

In contrast, in Sentence 2, the same words seem less similar, because this time we are talking about a population of bison - a group of animals rather than a group of people:

Sentence 2: The population of bison became a lot smaller when people settled in the valley.

Again, we would expect most people to agree that population and people have more similar meanings in Sentence 1 than in Sentence 2.

In the following tasks, we will ask you to select sentences which make the meanings of two words seem more and less similar.

Example 3 - From the pilot:

As we can see, even if these words are not particularly polysemous, there is a significant difference in the average score of similarity after reading each of the contexts. This could be related to the fact that the first context refers to "gazelle population".

We are building datasets, containing these contextual similarity ratings, in four (possibly five) different languages:

  • Croatian: HR
  • English: EN
  • Finnish: FI
  • Slovenian: SL
  • Estonian may be added to the list.

The pairs of words come from the well known SimLex999 dataset. The contexts will be chosen so as to encourage different perceptions of similarity. Polysemy will play a role, however we are especially interested in more subtle, graded changes in meaning.

You can read more details here:


The develpment of this task and the associated dataset is supported by the European Union Horizon 2020 research and innovation programme under Grant No. 825153, EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).


There are two different subtasks:

  1. Subtask 1: Predicting the change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts. This task directly addresses our main question. It evaluates how well systems are able to model the effect that context has in human perception of similarity.

  2. Subtask 2: Predicting the human scores of similarity for a pair of words within two different contexts. This is a more traditional task which evaluates systems' ability to model both similarity of words and the effect that context has on it.

Both are unsupervised tasks, we won't be releasing training data. Both use the same input data (pairs of words and contexts) but each of them has its own phases and leaderboards. This means the submissions are independent and you can use different models for each of the subtasks.

Practice Phase

The trial data contains small examples for English, Croatian and Slovene. When preparing the submission please follow the naming and format that you can see in the Starting Kit. Then compressed the results file using zip (without any extra directory) and submit the zip file through codalab in the "Participate" section. You will get a score and will be added to the leaderboard. Feel free to make as many practice submissions as you want.

The Starting Kit contains the trial data and examples of how to format your submissions:

  • data: Tab separated files containing the input pairs and contexts.

    English Dataset:

    word1 <tab> word2 <tab> context1 <tab> context2

    Other Datasets:

    word1 <tab> word2 <tab> context1 <tab> context2 <tab> word1_context1 <tab> word2_context1 <tab> word1_context2  <tab> word2_context2

    The additional fields contain the 'inflected' versions of the words as they appear in each of the contexts.
    For all languages the target words are additionally marked with a <strong></strong>.

  • gold: The gold standard human scores. These annotations won't be released until the end of the evaluation period.

    sim_context1 <tab> sim_context2 <tab> change = (sim_context2 - sim_context1)

  • submission_examples:

    Zip files containing different submissions that you can upload to Codalab to test the process.
  • submission_subtask1: Example of submission format

    change = (sim_context2 - sim_context1)

  • submission_subtask2: Example of submission format

    sim_context1 <tab> sim_context2

It also includes two python scripts in case you want to try the scoring offline.


  • Subtask 1 - Predicting Change of Similarity Ratings: We will use Uncentered Pearson correlation against gold scores by human annotators.
  • Subtask 2 - Predicting Similarity Ratings in Context: We will use the harmonic mean of the Pearson and Spearman correlations against the gold scores by human annotators.

We are interested in the highest scores per language. However it seems a bug in codalab doesn't allow the use of serveral leaderboards, so there will be only one leaderboard per subtask.

The position in that leaderboard will be calculated by the average of your results in each language. When teams don't submit results for some of the languages the baseline scores will be used instead (multilingual Bert).

This gives an advantage to teams that submit results for several languages, but only for the position in the codalab leaderboards. The goal is still to get the highest scores in each language independently.

Dates (Updated)

Please notice the dates for all SemEval2020 tasks have been updated

  • Test data ready December 3, 2019 February 19, 2020
  • Evaluation start January 10, 2020 February 19, 2020
  • Evaluation end January 31, 2020 March 11, 2020
  • System description paper submissions due February 23, 2020 April 17, 2020
  • SemEval workshop on September 13-14

Practice Subtask1

Start: Aug. 10, 2019, midnight

Description: Predicting the degree and direction of change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts.

Practice Subtask2

Start: Aug. 10, 2019, midnight

Description: Predicting the human scores of similarity for a pair of words within different contexts.

Evaluation Subtask1

Start: Feb. 18, 2020, midnight

Evaluation Subtask2

Start: Feb. 18, 2020, midnight

Post-Evaluation Subtask1

Start: March 12, 2020, midnight

Post-Evaluation Subtask2

Start: March 12, 2020, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 hansih 1.000
2 NahedAbdelgaber 1.000
3 will_go 1.000