MEDDOCAN - Medical Document Anonymization task

Organized by PlanTL-SANIDAD - Current server time: May 23, 2019, 6:59 p.m. UTC
Reward $3,800

First phase

April 4, 2019, midnight UTC


Competition Ends
May 18, 2019, noon UTC

MEDDOCAN: Medical Document Anonymization Task

IberLEF 2019 Workshop @ SEPLN 2019 (Bilbao), 24th Sep 2019

SEAD – Plan TL Sponsoring the MEDDOCAN Task Awards for track winners:


There is a prize for both sub-tracks: 1,000€ to each sub-track winner, 500€ to the second teams and 200€ to the third teams.

About the task

Clinical records with protected health information (PHI) cannot be directly shared “as is”, due to privacy constraints, making it particularly cumbersome to carry out NLP research in the medical domain. A necessary precondition for accessing clinical records outside of hospitals is their de-identification, i.e., the exhaustive removal, or replacement, of all mentioned PHI phrases.

The practical relevance of anonymization or de-identification of clinical texts motivated the proposal of two shared tasks, the 2006 and 2014 de-identification tracks, organized under the umbrella of the i2b2 ( community evaluation effort. The i2b2 effort has deeply influenced the clinical NLP community worldwide, but was focused on documents in English and covering characteristics of US-healthcare data providers.

As part of the IberLEF 2019 initiative we organize the first community challenge task specifically devoted to the anonymization of medical documents in Spanish, called the MEDDOCAN (Medical Document Anonymization) task.

In order to carry out these tasks we have prepared a synthetic corpus of 1000 clinical case studies. This corpus was selected manually by a practicing physician and augmented with PHI information from
discharge summaries and medical genetics clinical records.

The MEDDOCAN task will be structured into two sub-tasks:

1) NER offset and entity type classification.

2) Sensitive token detection.


For this task, we have prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. This MEDDOCAN corpus of 1,000 clinical case studies was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records. See an example of MEDDOCAN annotation visualized using the BRAT annotation interface in Figure 1.

Figure 1: An example of MEDDOCAN annotation visualized using the BRAT annotation interface.

For more detailed information see Description of the Corpus.


The MEDDOCAN corpus has been randomly sampled into three subsets: the train, the development, and the test set. The training set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Sample set

The sample set is composed of 15 clinical cases extracted from the training set. This sample set is also included in the evaluation script (see Resources). Download the sample set from here.

Train set

The train set is composed of 500 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the train set from here.

Development set

The Development set is composed of 250 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the development set from here.

Test set (including background set)

The Test set with the background set is composed of 3,751 clinical cases. It is distributed in plain text format. Download the test set (including the background set) from here.

Test set with gold Standard annotations

The Test set is with Gold Standard annotations is composed of 250 clinical cases. Available for download according to the established dates (see Schedule).



For the MEDDOCAN track, we will follow essentially a similar evaluation setting as used for previous de-identification tracks posed at i2b2. We will set up an external scientific advisory board with international experts on this topic, to provide feedback and experience from de-identification efforts carried out in the US and UK.

We are also aware that there is a considerable implicit variability between document types and between hospitals that in practice do affect the difficulty of this kind of tracks, but being the first time such a task is being carried out for Spanish we do not want to add another additional level of complexity. From previous de-identification efforts it became clear that different uses might require different balance in terms of precision and recall. For instance for internal use within the hospital settings (limited data release), high precision is more desirable as there is a reduced risk of exposure of these documents, while for instance in case of HIPAA-compliant release in the US, high recall is critical to avoid sensitive data leakage (unlimited data release).

Evaluation of automatic predictions for this task will have two different scenarios or sub-tracks: the NER offset and entity type classification sub-track and the sensitive span detection sub-track.

NER offset and entity type classification: The first evaluation scenario will consist of the classical entity-based or instanced-based evaluation that requires that system outputs match exactly the beginning and end locations of each PHI entity tag, as well as detecting correctly the annotation type.
Sensitive span detection: The second evaluation scenario or sub-track is more specific to the practical scenario needed for releasing de-identified clinical documents, where the ultimate goal is to identify and be able to obfuscate or mask sensitive data, regardless the actual type of entity or the correct offset identification of multi-token sensitive phrase mentions. This second sub-track will consider a span-based evaluation, by just evaluating whether spans belonging to sensitive phrases are detected correctly. This boils down to a classification of spans, where systems try to obfuscate spans that contain sensitive PHI expressions.

As part of the evaluation process, we plan to carry out a statistical significance testing between system runs using approximate randomization, following settings previously used in the context of i2b2
challenges. The used evaluation scripts together with proper documentation and README files with instructions will be freely available on GitHub, to enable local testing of evaluation scripts by
participating teams.

For both sub-tracks the primary de-identification metrics used will consist of standard measures from the NLP community, namely micro-averaged precision, recall, and balanced F-score:

Precision (P) = true positives/(true positives + false positives)

Recall (R) = true positives/(true positives + false negatives)

F-score (F1) = 2*((P*R)/(P+R))

For both sub-tracks, the official evaluation and the ranking of the submitted systems will be based exclusively in the F-score (F1) measure (labeled as “SubTrack 1 [NER]” and “SubTrack 2 [strict]” in the evaluation script). The other metrics explained below are given only to provide more detailed information about the performance of the systems.

Moreover, there will be also sub-track specific evaluation metrics. In case of the first sub-track the leak scores, previously proposed for the i2b2 challenges will be computed, being related to detection of leaks (non-redacted PHI remaining after de-identification), that is (# false negatives / # sentences present). In the case of the second sub-track we will also additionally compute another evaluation where we will merge the spans of PHI connected by non-alphanumerical characters. These metrics are not the official metrics of the task.

In terms of participating submissions, we will allow up to 5 runs by each registered team. Submissions have to be provided in a predefined prediction format (brat o i2b2) and be returned to the track organizers before the test set submission due (end of May).

See evaluation examples here.

Submission Format  

A for this competition would look similar (In this example we submit brat format, you can submit xml format):
     |- brat
           |- subtask1
                  | - S0004-06142005000500011-1.ann
                  | - S0004-06142005000500011-1.txt
                  |- ...
           |- subtask2
                  |- S0004-06142005000500011-1.ann
                  |- S0004-06142005000500011-1.txt
                  |- ...


1. The root of the zip file should be brat or xml directory.

2. Users must annotate all test set, otherwise, the submission will not be processed.

Terms and Conditions


In conformity with the Personal Data Protection Normative, (General Data Protection Regulation 2016/679), hereby you authorize that the personal data you have provided will be incorporated to Google Drive located on Google Inc servers. More information about Google's privacy policy at:

The purpose of this treatment is to contact the participants of the "MEDDOCAN" Task to send information related to the task.

This data will not be transmitted to third parties and will be preserved for a maximum of 4 years.
In any case, it is possible to revoke your given consent at any time, as well as exercise your right of access, modification, rectification or removal, the limitation of the treatment or opposition to it, as well as the right to data portability. All these requests have to be applied to C/ Jordi Girona nº 31, 08034 Barcelona (Spain), or by contacting the Data Officer of the BSC-CNS at the following email address:

You can also submit a claim to the Spanish Data Protection Agency.


Start: April 4, 2019, midnight

Description: Sub-task 1) NER offset and entity type classification. Sub-Task 2) Sensitive token detection.

Test Set (Includes background set)

Start: April 29, 2019, midnight

Description: Sub-task 1) NER offset and entity type classification. Sub-Task 2) Sensitive token detection.

Competition Ends

May 18, 2019, noon

You must be logged in to participate in competitions.

Sign In