SemEval 2021 - Source-Free Domain Adaptation for Semantic Processing

Organized by Egoitz - Current server time: Nov. 30, 2020, 6:18 p.m. UTC


June 9, 2020, midnight UTC


Jan. 10, 2021, midnight UTC


Competition Ends



This is the CodaLab Competition for SemEval-2021 Task 10: Source-Free Domain Adaptation for Semantic Processing.

Please join our Google Group to ask questions and get the most up-to-date information on the task.

Important Dates:

20 Aug 2020:   Pre-trained models release
3 Dec 2020:   Test data release
10 Jan 2021:   Evaluation start
31 Jan 2021:   Evaluation end


Data sharing restrictions are common in NLP datasets. For example, Twitter policies do not allow sharing of tweet text, though tweet IDs may be shared. The situation is even more common in clinical NLP, where patient health information must be protected, and annotations over health text, when released at all, often require the signing of complex data use agreements. The SemEval-2021 Task 10 framework asks participants to develop semantic annotation systems in the face of data sharing constraints. A participant's goal is to develop an accurate system for a target domain when annotations exist for a related domain but cannot be distributed. Instead of annotated training data, participants are given a model trained on the annotations. Then, given unlabeled target domain data, they are asked to make predictions.


We propose two different semantic tasks to which this framework will be applied: negation detection and time expression recognition.

  • Negation detection asks participants to classify clinical event mentions (e.g., diseases, symptoms, procedures, etc.) for whether they are being negated by their context. For example, the sentence: Has no diarrhea and no new lumps or masses has three relevant events (diarrhea, lumps, masses), two cue words (both no), and all three entities are negated. This task is important in the clinical domain because it is common for physicians to document negated information encountered during the clinical course, for example, when ruling out certain elements of a differential diagnosis. We expect most participants will treat this as a "span-in-context"' classification problem, where the model will jointly consider both the event to be classified and its surrounding context. For example, a typical transformer-based encoding of this problem for the diarrhea event in the example above looks like: Has no <e>diarrhea</e> and no new lumps or masses.
  • Time expression recognition asks participants to find time expressions in text. This is a sequence-tagging task that will use the fine-grained time expression annotations that were a component of SemEval 2018 Task 6 (Laparra et al. 2018). For example:

    We expect most participants will treat this as a sequence classification problem, as in other named-entity tagging tasks.


Egoitz Laparra, Yiyun Zhao, Steven Bethard (University of Arizona)

Tim Miller (Boston Children's Hospital and Harvard Medical School)

Özlem Uzuner (George Mason University)


Laparra E., Xu D., Elsayed A., Bethard S., and Palmer M. SemEval 2018 task 6: Parsing time normalizations. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana. 2018.


Negation detection will be evaluated using the standard precision, recall and F1 scores as used in most published work: recall points are gained by correctly predicting that a negated entity is negated, precision points are obtained if a predicted negation is correct.

Time expression recognition will be evaluated using the standard precision, recall and F1 previously used for the entity-finding portion of SemEval 2018 Task 6.

Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2021 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to redistribute the test data except in the manner prescribed by its licence.


Since the scenario proposed by SemEval-2021 Task 10 is domain adaptation with no access to the source data, no annotated training set is distributed. Instead, participants are provided with models trained on that source data, the development data representing a new domain on which participants can explore domain adaptation algorithms, and the test data representing another new domain on which the participant's approaches will be evaluated.

For negation detection, the development data is the i2b2 2010 Challenge Dataset, a de-identified dataset of notes from Partners HealthCare, containing 2886 unlabeled train instances (entities in sentence context), and 5545 dev instances with a corresponding labeling for with negation status. The original i2b2 data set had multi-label annotations in the set Asserted, Negated, Uncertain, Hypothetical, Conditional, FamilyRelated - to align with other challenge datasets we have kept the Negated category but mapped all other categories to "Not negated." The i2b2 2010 Challenge data requires a Data Use Agreement with Partners HealthCare, in order to access the development data, participants must first obtain access through the n2c2/DBMI Data Portal. After downloading the 2010 data, participants can then run scripts that are in the Github repo for this task.

For time expression recognition, the development data is the annotated news portion of the SemEval 2018 Task 6 data. The source text is from the freely available TimeBank, and the 2,000+ time entity annotations are stored in Anafora XML format.

Participants should also obtain access to the MIMIC III corpus v1.4, as a portion of it may be used for one or both of the test sets. Access to the MIMIC data requires participants to complete a CITI "Data or Specimens Only Research" online course, and then make an online request through PhysioNet. The course takes only a couple of hours online, and access requests are typically approved within a few days.


Paricipants are provided with trained models for both negation detection and time expression recognition. In both cases, we have used the RoBERTa-base (Liu et al., 2019) pretrained model included in the Huggingface Transformers library:

  • For negation detection, we provide a "span-in-context" classification model, fine-tuned on the 10,259 instances (902 negated) in the SHARP Seed dataset of de-identified clinical notes from Mayo Clinic, which the organizers have access to but cannot currently be distributed (models are approved to be distributed). In the SHARP data, clinical events are marked with a boolean polarity indicator, with values of either asserted or negated.
  • For time expression recognition, we provide a sequence tagging model, fine-tuned on the 25,000+ time expressions in the de-identified clinical notes from the Mayo Clinic in SemEval 2018 Task 6, which are available to the task organizers, but are difficult to gain access to due to the complex data use agreements necessary (models are approved to be distributed).


Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., and Stoyanov V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint. 2019.


Getting Started: Negation

Get the unlabeled development data

The practice data (development data) is a subset of the i2b2 2010 Challenge on concepts, assertions, and relations in clinical text. If you do not already have access to this data, you will need to request access at the DBMI Data Portal. If you have obtained access, follow the portal link above and download the data by expanding the "2010 Relations Challenge Downloads" tab, and downloading the three files with the following titles:

  • Training Data: Concept assertion relation training data
  • Test Data: Reference standard for test data
  • Test Data: Test data

At the time of writing, these are the last 3 links for the 2010 data. This should give you the following files, which you should save to a single directory:

  • concept_assertion_relation_training_data.tar.gz
  • reference_standard_for_test_data.tar.gz
  • test_data.tar.gz

Extract each of these with:

  • tar xzvf concept_assertion_relation_training_data.tar.gz
  • tar xzvf reference_standard_for_test_data.tar.gz
  • tar xzvf test_data.tar.gz

Next we will extract an unlabeled training set, unlabeled evaluation set, and a label file for the evaluation set (to test submissions and see the format). If you don't already have the task repo checked out, do so and enter the project directory:

$ git clone && cd source-free-domain-adaptation

Then to extract the training files, run the i2b2 extraction script with:

$ mkdir -p practice_text/negation && python3 <directory with three extracted i2b2 2010 folders> practice_text/negation

This will extract the three files into practice_text/negation:

  • train.tsv -- the unlabeled training data
  • dev.tsv -- the unlabeled deveopment data
  • dev_labels.txt -- the labels for dev data

The idea during the practice time is to use train.tsv as representative target-domain data to improve the system, and then evaluate any improvements to your system on dev.tsv.

Get the pretrained model and make predictions

To use the trained model to make predictions, install the requirements and run the script to process the practice data as follows:

$ pip3 install -r baselines/negation/requirements.txt
$ python3 baselines/negation/ -f practice_text/negation/dev.tsv -o submission/negation/

This script will write a file called submission/negation/system.tsv with one label per line.

Get and prepare the practice data

The trial data for the practice phase consists of 99 articles from the AQUAINT, TimeBank and te3-platinum subsets of TempEval-2013, i.e. "Newswire" domain.

You can automatically download and prepare the input data for this phase running the script available in the task repository. If you don't already have the task repo checked out and the requirements installed, you need to do so first:

$ git clone && cd source-free-domain-adaptation

$ pip3 install -r baselines/time/requirements.txt

$ python3 practice_text/

This will create a practice_text/time directory containing the plain text of the documents used in this task.

Get the model and make predictions on the practice data

The baseline for the time expression recognition is based on the pytorch implementation of RoBERTa by Hugging Face. We have used the RobertaForTokenClassification architecture from Hugging Face/transformers library to fine-tune roberta-base on 25,000+ time expressions in de-identified clinical notes. The resulting model is a sequence tagger that we have made available in Hugging Face model hub: clulab/roberta-timex-semeval. The following table shows the in-domain and out-of-domain (practice_data) performances:

in-domain_data 0.967 0.968 0.968
practice_data 0.775 0.768 0.771

The task repository contains scripts to load and run the model: time baseline. These scripts are based on the Hugging Face/transformers library that allows easily incorporating the model into the code. See for example, the code from the baseline that loads the model and its tokenizer.

The first time you run such code, the model will be automatically downloaded in your computer. The scripts also include the basic functionality to read the input data and produce the output Anafora annotations. You can use the script to parse raw text and obtain time expressions. For example, to process the practice data, run:

$ python3 baselines/time/ -p practice_text/time/ -o submission/time/

This will create one directory per document in submission/time containing one .xml file with predictions in Anafora format.

Extend the baseline model

There are many ways to try to improve the performance of this baseline on the practice text (and later, on the evaluation text). Should you need to continue training the clulab/roberta-timex-semeval model on annotated data that you have somehow produced, you can run the script:

$ python3 baselines/time/ -t /path/to/train-data -s /path/to/save-model

The train-data directory must follow a similar structure to the practice_text/time folder and include, for each document, a the raw text file (with no extension) and an Anafora annotation file (with .xml extension). After running the training, the save-model directory will contain three files (pytorch_model.bin, training_args.bin and config.json) with the configuration and weights of the final model, and the vocabulary and configuration files used by the tokenizer (vocab.json, merges.txt, special_tokens_map.json and tokenizer_config.json).

Uploading predictions to CodaLab

To upload your predictions to CodaLab, first make sure that your predictions are formatted correctly, then create a and upload it to CodaLab.

Formatting system output

For negation detection, the output format is one classifier output per line, where the lines correspond to the lines in the input. A prediction of "Negated" should be output as 1, while a prediction of "Not negated" should be output as -1.

For time expression recognition, your system must produce Anafora XML format files in Anafora's standard directory organization.

Make sure that you comply with following rules when you create  your output directory:

  • The root must contain only the track directories, negation and time. If you are not participating in one of the tracks, do not include its directory.
  • In the negation directory, include a single tsv file with the name system.tsv.
  • The time directory, follow the same structure and names as in the dataset:
    • Each top-level directory must contain only document directories, named exactly as in the input dataset.
    • Each document directory must contain only the corresponding annotation file.
    • The name of each annotation file must match the document name and have a .TimeNorm.system.completed.xml extension.

For example, for the development data, your directory structure should look like:

  • negation
    • system.tsv
  • time
      • APW19980807.0261
        • APW19980807.0261.TimeNorm.system.completed.xml
      • APW19980808.0022
        • APW19980808.0022.TimeNorm.system.completed.xml
      • ...
    • TimeBank
      • ABC19980108.1830.0711
        • ABC19980108.1830.0711.TimeNorm.system.completed.xml
      • ABC19980114.1830.0611
        • ABC19980114.1830.0611.TimeNorm.system.completed.xml
      • ...

Generating and uploading

The easiest way to generate is to use the Makefile provided in the sample code repository. First, place your prediction files - including the entire directory structure described above - under a submission directory in the root of the sample code checkout. Then run make This will zip up all your prediction files and produce a file,

To upload your submission, go to the CodaLab competition page. Find the "Participate" tab, then the "Submit/View Results" navigation element, then make sure "Practice" button is highlighted, and click the "Submit" button. Find your with the file chooser and upload. The scoring will run in the background -- usually you can refresh the page in about a minute to see the result in the table below.


You may see the error:

Traceback (most recent call last):
  File "/worker/", line 330, in run
    if input_rel_path not in bundles:
TypeError: argument of type 'NoneType' is not iterable

This is a known issue with CodaLab. The solution for now is to make a new submission with the same


Start: June 9, 2020, midnight


Start: Jan. 10, 2021, midnight


Start: Jan. 31, 2021, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 wyclover 0.886
2 xinsu 0.870
3 Egoitz 0.834