SemEval 2020 Task 3 - Predicting the (Graded) Effect of Context in Word Similarity

Organized by csantosarmendariz - Current server time: Sept. 19, 2019, 7:15 p.m. UTC

Previous

Practice Subtask1
Aug. 10, 2019, midnight UTC

Current

Evaluation Subtask2
Jan. 10, 2020, midnight UTC

Next

Evaluation Subtask1
Jan. 10, 2020, midnight UTC

Overview

For this tasks we ask participants to build systems that try to predict the effect that context has in human perception of similarity of words.

We have seen very interesting work that uses local context to predict discrete changes in meaning: the different senses of a polysemous word. However context also has more subtle, continuous (graded) effects on meaning, even for words not necessarily considered polysemous.

Datasets

In order to be able to look at these effects we are building several datasets where we ask annotators to score how similar a pair of words are after they have read a short paragraph (which contains the two words). Each pair is scored within two of these paragraphs, allowing us to look at changes in similarity ratings due to context.

Let's see some examples from the instructions we give to annotators.

Example 1 - Room and Cell:

The meaning of words can be affected by the sentences and contexts in which we find them. In some cases, this is because different contexts make it clear that completely different senses of a word are being intended.

In Sentence 1 below, the words room and cell both refer to different kinds of room in a building:

Sentence 1: Her prision cell was almost an improvement over her room at the last hostel.

However, in sentence 2 below, the words are being used in different senses ('room' as an abstract concept and 'cell' as a biological term):

Sentence 2: His job as a biologist didn't leave much room for a personal life. He knew much more about human cells than about human feelings.

We would expect most people to agree that room and cell have much more similar meanings in Sentence 1 than they do in Sentence 2.

Example 2 - Population and People:

However, context can affect meaning in more subtle ways too, making us think of concepts in slightly different ways even if they have the same overall sense.
For example, in Sentence 1 below, the words 'population' and 'people' seem quite closely related, because we know that the population is made of people:

Sentence 1: The population of India is actually bigger than most people expect.

In contrast, in Sentence 2, the same words seem less similar, because this time we are talking about a population of bison - a group of animals rather than a group of people:

Sentence 2: The population of bison became a lot smaller when people settled in the valley.

Again, we would expect most people to agree that population and people have more similar meanings in Sentence 1 than in Sentence 2.

In the following tasks, we will ask you to select sentences which make the meanings of two words seem more and less similar.

Example 3 - From the pilot:

As we can see, even if these words are not particularly polysemous, there is a significant difference in the average score of similarity after reading each of the contexts. This could be related to the fact that the first context refers to "gazelle population".

We are building datasets, containing these contextual similarity ratings, in four (possibly five) different languages:

  • Croatian: HR
  • English: EN
  • Finnish: FI
  • Slovenian: SL
  • Estonian may be added to the list.

The pairs of words come from the well known SimLex999 dataset. The contexts will be chosen so as to encourage different perceptions of similarity. Polysemy will play a role, however we are especially interested in more subtle, graded changes in meaning.

 

 

Submissions

There are two different subtasks:

  1. Subtask 1: Predicting the change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts. This task directly addresses our main question. It evaluates how well systems are able to model the effect that context has in human perception of similarity.

  2. Subtask 2: Predicting the human scores of similarity for a pair of words within two different contexts. This is a more traditional task which evaluates systems' ability to model both similarity of words and the effect that context has on it.

Both are unsupervised tasks, we won't be releasing training data. Both use the same input data (pairs of words and contexts) but each of them has its own phases and leaderboards. This means the submissions are independent and you can use different models for each of the subtasks.

Practice Phase

For the moment the trial data only contains a small English sample. We will provide samples of the other languages in September. However you can already test the submissions and the online and offline scoring against that small sample.

When preparing the submission please follow the naming and format that you can see in the Starting Kit. Then compressed the results file using zip (without any extra directory) and submit the zip file through codalab here. You will get a score and will be added to the leaderboard. Feel free to make as many practice submissions as you want.

The Starting Kit contains the trial data and examples of how to format your submissions:

  • data: Tab separated files containing the input pairs and contexts.

    word1 <tab> word2 <tab> context1 <tab> context2 <tab> simlex score

  • gold: The gold standard human scores. These annotations won't be released until the end of the evaluation period.

    sim_score_context1 <tab> sim_score_context2 <tab> difference = (sim_score_context2 - sim_score_context1)

  • submission_subtask1: Example of submission format

    difference in the similarity scores

  • submission_subtask2: Example of submission format

    sim_score_context1 <tab> sim_score_context2

It also includes two python scripts in case you want to try the scoring offline.

Evaluation

  • Subtask 1 - Predicting Change of Similarity Ratings: We will use Uncentered Pearson correlation against gold scores by human annotators.
  • Subtask 2 - Predicting Similarity Ratings in Context: We will use Spearman correlation against the gold scores by human annotators.

When more languages are available, the ranking will be individual per language. However it seems a bug in codalab doesn't allow the use of serveral leaderboards, so there will be only one leaderboard per subtask.

The position in that leaderboard will be calculated by the average of your results in each language. When teams don't submit results for some of the languages the baseline scores will be used instead.

This gives an advantage to teams that submit results for several languages, but only for the position in the codalab leaderboards. The goal is still to get the highest scores in each language independently.

Dates

The trial data for English is released and we are now in the "Practice" phase.

Samples for other languages will be released in September together with some baseline models. Once these are available we will update the instructions here and the Starting Kit to describe how to submit results for multiple languages.

The official "Evaluation" phase starts on January 10, 2020 and will be open until January 31, 2020.

Practice Subtask1

Start: Aug. 10, 2019, midnight

Description: Predicting the degree and direction of change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts.

Practice Subtask2

Start: Aug. 10, 2019, midnight

Description: Predicting the human scores of similarity for a pair of words within different contexts.

Evaluation Subtask1

Start: Jan. 10, 2020, midnight

Evaluation Subtask2

Start: Jan. 10, 2020, midnight

Post-Evaluation Subtask1

Start: Feb. 1, 2020, midnight

Post-Evaluation Subtask2

Start: Feb. 1, 2020, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In