SMM4H'19 - Shared Task

Organized by dweissen - Current server time: Dec. 5, 2019, 6:20 p.m. UTC

Previous

Sub-Task 3: ADR normalization, Post-Evaluation
April 19, 2019, midnight UTC

Current

Sub-Task 4: health concerns, Post-Evaluation
April 19, 2019, midnight UTC

End

Competition Ends
Never

Task Definition

The proposed SMM4H shared tasks involve NLP challenges on social media mining for health monitoring and surveillance. This requires processing imbalanced, noisy, real-world, and substantially creative language expressions from social media. The proposed systems should be able to deal with many linguistic variations and semantic complexities in various ways people express medication-related concepts and outcomes. It has been shown in past research that automated systems frequently underperform when exposed to social media text because of the presence of novel/creative phrases and misspellings, and frequent use of idiomatic, ambiguous and sarcastic expressions. The tasks will thus act as a discovery and verification process of what approaches work best for social media data.

Similar to the first three runs of the shared tasks, the data include annotated collections of posts on Twitter. The training data is already prepared and will be available to the teams registering to participate.

The four shared tasks proposed this year are:

  • Task 1: Automatic classifications of adverse effects mentions in tweets
  • Task 2: Extraction of adverse effect mentions
  • Task 3: Normalization of adverse drug reaction mentions
  • Task 4: Generalizable identification of personal health experience mentions

Timeline (Tentative)

 Jan 23, 2019  Trial Data Release, Practise Phase starts 
 Feb 22, 2019  Training Data Release 
 April 15, 2019  Test Data Release, Evaluation Phase starts  
 April 19, 2019  Evaluation Phase ends, Post-Evaluation Phase starts 

Organizers

  • Graciela Gonzalez-Hernandez, Ph.D., The Perelman School of Medicine, University of Pennsylvania [web]
  • Davy Weissenbacher, Ph.D., The Perelman School of Medicine, University of Pennsylvania [web|mail: dweissen@pennmedicine.upenn.edu]
  • Michael Paul, Ph.D., Department of Information Science, University of Colorado-Boulder [web]
  • Abeed Sarker, Ph.D., The Perelman School of Medicine, University of Pennsylvania [web]
  • Ashlynn R. Daughton, MPH, Department of Information Science, University of Colorado Boulder
  • Arjun Magge, MS, College of Health Solutions, Arizona State University
  • Ari Z. Klein, Ph.D., The Perelman School of Medicine, University of Pennsylvania
  • Karen O'Connor, MS, The Perelman School of Medicine, University of Pennsylvania

Evaluation Metrics

With TP, FP, TN, and FN standing for True Positive, False Positive, True Negative and False Negative, respectively, the list below details all metrics used for each task:

  • Task 1: Precision=TP/(TP+FP) ; Recall=TP/(TP+FN) and the balanced F1-score=2*((Precision * Recall)/(Precision + Recall)), with the Adverse Effect being the positive class
  • Task 2: Relaxed and Strict Precision, Recall and the balanced F1-score.
  • Task 3: Relaxed and Strict Precision, Recall and the balanced F1-score.
  • Task 4: Accurracy=(TP+TN)/(TP+TN+FP+FN), Precision, Recall and the balanced F1-score

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the SMM4H'19 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers. You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science. You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers. You further agree to submit and present a short paper describing your system during the workshop. You agree not to redistribute the training and test data except in the manner prescribed by its licence.

The official results of the competition have been published in: Overview of the Fourth Social Media Mining for Health (SMM4H) Shared Tasks at ACL 2019. We left the competition open, feel free to try to beat the best systems!

Task Details

Task 1: Automatic classifications of adverse effects mentions in tweets

The designed system for this sub-task should be able to distinguish tweets reporting an adverse effect (AE) from those that do not, taking into account subtle linguistic variations between adverse effects and indications (the reason to use the medication). This is a rerun of the popular classification task organized in 2016, 2017, and 2018.

Data

  • Training data: 25,672 (2,374 positive and 23,298 negative)
  • Evaluation data: approximately 5,000 tweets.
  • Evaluation metric: F-score for the ADR/positive class.

For each tweet, the publicly available data set contains: (i) the user ID, (ii) the tweet ID, and (iii) the binary annotation indicating the presence or absence of ADRs, as shown below. The evaluation data will contain the same information, but without the classes. Participating teams should submit their results in the same format as the training set (shown below).

Tweet ID            User ID      Class
354256195432882177  54516759     0
352456944537178112  1267743056   1
332479707004170241  273421529    0
340660708364677120  135964180    1

Task 2: Extraction of Adverse Effect mentions

As a follow-up step of Task 1, this task includes identifying the text span of the reported ADRs and distinguishing ADRs from similar non-ADR expressions. ADRs are multi-token, descriptive, expressions, so this subtask requires advanced named entity recognition (NER) approaches. The data for this sub-task includes 2000+ tweets which are fully annotated for mentions of ADRs and Indications. This set contains a subset of the tweets from Task 1 tagged as hasADR plus an equal number of noADR tweets. Some tweets in the noADR subset were annotated for mentions of Indications to allow participants to develop techniques to deal with this confusion class.

Data

  • Training data: 2,367 (1,212 positive and 1,155 negative)
  • Evaluation data: 1,000 (~500 positive, ~500 negative)
  • Evaluation metric: Strict and Relaxed F1-score, Precision and Recall

For each tweet, the publicly available data set contains: (i) the tweet ID, (ii) the start and (iii) end of the span, (iv) the annotation indicating an ADR or not and (v) the text covered by the span in the tweet. The evaluation data will just contain the tweet IDs.

Tweet ID           Begin  End  Class  text
346575368041410561 106    120  ADR    gaining weight 
345922687161491456 27     34   ADR    sweatin 
343961812334686208 -      -    noADR  - 
345167138966876161 -      -    noADR  - 
342380499802681344 118    139  ADR    difficult to come off

Task 3: Normalization of adverse drug reaction mentions (ADR)

The Task-3 is an end-to-end task, where the objective is to detect tweets mentioning an ADR and to map the extracted colloquial mentions of ADRs in the tweets to standard concept IDs in the MedDRA vocabulary (lower level terms). This task requires to understand the semantic interpretation of ADRs in order to map them to standard concept IDs. This task is likely to require a semi-supervised approach to successfully disambiguate ADRs.

Data

  • Training data: 2,367 (1,212 positive and 1,155 negative)
  • Evaluation data: 1,000 (~500 positive, ~500 negative)
  • Evaluation metric: Strict and Relaxed F1-score, Precision and Recall

For each ADR mention, the publicly available data set contains: (i) the tweet ID, (ii) the start and (iii) end of the span, (iv) the annotation indicating an ADR or not, (v) the text covered by the span in the tweet and (iii) the corresponding ID of the preferred term in the MedDRA vocabulary. The evaluation data will just contain the tweet IDs.

Tweet ID           Begin  End  Class  Text                  MEDDRA ID
346575368041410561 106    120  ADR    gaining weight        10047899 
345922687161491456 27     34   ADR    sweatin               10020642 
343961812334686208 -      -    noADR  -                     - 
345167138966876161 -      -    noADR  -                     - 
342380499802681344 118    139  ADR    difficult to come off 10048010

Task 4: Generalizable identification of personal health experience mentions

This binary classification task is to classify whether a tweet contains a personal mention of one's health (for example, sharing one's own health status or opinion), as opposed to a more general discussion of the health issue, or an unrelated mention of the word. The shared task will involve multiple tweet datasets annotated for personal health mentions across different health issues. The training data will include data from one disease domain (influenza) across two contexts (being sick and getting vaccinated), both annotated for personal mentions (the user is personally sick or the user has been personally vaccinated). The test data will include an additional disease domain beyond influenza in a different context than being sick or being vaccinated.

Data

  • Training data: 10,876 tweets
  • Evaluation data: TBA
  • Evaluation metric: Accuracy, F1-score, Precision and Recall

Each dataset includes (i) the tweet ID, and (ii) the binary annotation. The evaluation data will contain the same information, but without the class labels.

Tweet ID   Label
6004314210 0 
6003319713 0 
5991525204 0 
5989718714 0 
5986621813 0

FAQ

Q: How will I submit my results? A: The submission format for each task is described along with the description. The submissions will be made through codalab. Q: How many submissions can I make? A: For each task, three submissions from each team will be accepted. You can participate in one or multiple tasks. Q: Can I participate in Task 2 only? A: Yes. You can participate in any number of tasks. Q: Are there any restrictions on data and resources that can be used for training the classification system? For example, can we use manually or automatically constructed lexicons? Can we use other data (e.g., tweets, blog posts, medical records) annotated or unlabeled? A: There are currently no restrictions on data and resources. External resources and data can be used. All external resources need to be explained in the system description paper. Q: Is there any information on the test data? Will the test data be collected in the same way as the training data? For example, will the same drug names be used to collect tweets? A: The test data has been collected the same way.

Sub-Task 1: ADR classification, Practice

Start: Jan. 1, 2019, midnight

Sub-Task 1: ADR classification, Evaluation

Start: April 15, 2019, midnight

Sub-Task 1: ADR classification, Post-Evaluation

Start: April 19, 2019, midnight

Sub-Task 2: ADR extraction, Practice

Start: Jan. 1, 2019, midnight

Sub-Task 2: ADR extraction, Evaluation

Start: April 15, 2019, midnight

Sub-Task 2: ADR extraction, Post-Evaluation

Start: April 19, 2019, midnight

Sub-Task 3: ADR normalization, Practice

Start: Jan. 1, 2019, midnight

Sub-Task 3: ADR normalization, Evaluation

Start: April 15, 2019, midnight

Sub-Task 3: ADR normalization, Post-Evaluation

Start: April 19, 2019, midnight

Sub-Task 4: health concerns, Practice

Start: Jan. 1, 2019, midnight

Sub-Task 4: health concerns, Evaluation

Start: April 15, 2019, midnight

Sub-Task 4: health concerns, Post-Evaluation

Start: April 19, 2019, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 sgondane -
2 manrajsingh -
3 prady -