The following articles/book have been published using this dataset. Please cite these publications if you use this dataset. Please let us know if you have any publication using this dataset, we would like to list that here.
Cell classification via image processing has recently gained interest from the point of view of building computer assisted diagnostic tools for blood disorders such as leukemia. In order to arrive at conclusive decision on disease diagnosis and degree of progression, it is very important to identify malignant cells with high accuracy. Computer assisted tools can be very helpful in automating the process of cell segmentation and identification. Identification of maIignant cells vis-à-vis normal cells from the microscopic images is difficult because morphologically both cells types appear similar.
As a consequence, leukemia (blood cancer) is detected in advanced cancer stages via microscopic image analysis, not because of the ability to identify these under the microscope, but because of the medical domain knowledge, i.e., the cancer cells start growing in an unrestricted fashion and hence, they are present in much more larger numbers as compared to their numbers in a normal person.
It is, however, important to do early disease diagnosis for better cure and for improving the overall survival of the subjects suffering with cancer. Although advanced methods such as flow cytometry are available, they are very expensive and are not available widely in pathology laboratories or hospitals, particularly, in rural areas. On the other hand, a computer based solution can be deployed easily at a much lesser cost. It is hypothesized that advanced methods of medical image processing can lead to the identification of normal versus malignant cells and hence, can aid in the diagnosis of cancer in a cost effective manner.
Hence, this is an effort to build an automated classifier that will overcome the problems associated with deploying sophisticated high-end machines with recurring reagent cost. It will also aid pathologists and oncologists to make quicker and data driven inferences.
In this challenge, a dataset of cells with labels (normal versus malignant) will be provided to train machine learning based classifier to identify normal cells from leukemic blasts (malignant cells). These cells (as shown above in figure-1) have been segmented from the images after those images have been stain normalized. The overall size of the images were 2560x1920, while a single cell image is roughly of the size of 300x300 pixels. The images are representative of images in the real-world because these contain some staining noise and illumination errors, although these errors have largely been fixed by us via our own inhouse method on stain normalization [1,2].
The ground truth has been marked by the expert oncologist. The aim of the challenge is to compare and rank the different competing methods developed by participants based on the performance metrics such as accuracy and F1 scores.
This problem is very challenging because as stated above, morphologically, the two cell types appear very similar. The ground truth has been marked by the expert based on the domain knowledge. Also, with our efforts in the past two years, we have also recognized that the subject level variability also plays a key role and as a consequence, it is challenging to build a classifier that can yield good results on prospective data. Anyone deeply interested in working on a challenging problem of medical image classification via building newer deep learning/machine learning architectures would, in our opinion, come forward to work on this challenge.
References and credits:
[1] Anubha Gupta, Rahul Duggal, Ritu Gupta, Lalit Kumar, Nisarg Thakkar, and Devprakash Satpathy, “GCTI-SN: Geometry-Inspired Chemical and Tissue Invariant Stain Normalization of Microscopic Medical Images,” under review. .
[2] Ritu Gupta, Pramit Mallick, Rahul Duggal, Anubha Gupta, and Ojaswa Sharma, "Stain Color Normalization and Segmentation of Plasma Cells in Microscopic Images as a Prelude to Development of Computer Assisted Automated Disease Diagnostic Tool in Multiple Myeloma," 16th International Myeloma Workshop (IMW), India, March 2017.
Based upon ranking in the leaderboard and submitted write-up, the ground truth of the preliminary test set will be released for the top 10% teams/individuals (20 max). This ground truth can be used by the participants for re-training/fine-tuning the model so as to give better performance on final test set. Finally, at the end of phase 2, pariticipants are required to submit a detailed paper in ISBI format with the following results on the preliminary test set and supplementary files conataing training curves and all the relevent results:
The top three scorers on the leaderboard will be invited to attend the ISBI-2019 workshop. These participants will also be required to provide developed code and trained models to the organizing team.
Scikit-learn library is to be used for calculating weighted-precision, weighted-recall and weighted-f1 score. Please note that 'weighted' metrics will be used for performance evaluation purpose.
References:
[1] Hammack, Daniel, and Julian de Wit. “Predicting Lung Cancer” blog.kaggle.com. http://blog.kaggle.com/2017/06/29/2017-data-science-bowl-predicting-lung-cancer-2nd-place-solution-write-up-daniel-hammack-and-julian-de-wit/ (Accessed 17th August 2018)
Please cite the following papers if you are using this challenge (or dataset) for any research purpose:
Publication Citation
This is a two class classification problem. The training and testing data contains folder level division of individual subjects. Example directory tree for data division is shown below:
Train folder:
Test folder:
The dataset contains a total of 118 individual subjects, distributed as follows:
Data Availability Schedule: The data will be released in three phases.
Phase 1: During the first phase, we will release the training set.
Phase 2: During the second phase, we will release the preliminary test set.
Phase 3: During the third and final phase, we will release the final test set.
The test set would only be made available to each participating team which submitted a detailed paper at the end of phase 2 (refer 'Evaluation' section). The participants would be given a limited time window (refer 'Important Dates' section) after the release of the test dataset. Participants are required to make the final submission within the given time frame.
What general pre-processing steps will be performed?
The data is already preprocessed and does not require any further processing. However, participants are free to apply any further processing techniques, if required.
The dataset is imbalanced and participants need to consider this fact while training the model so as to give satisfactory performance on the test set. Also, subject level variability would play an important during performance of model on test set and hence should be addressed during the training of the model. For example, training and validation splits should be done on subject level instead of image level. In case this is not addressed, model may lead to poor performance on prospective subjects data.
Important Dates |
|
Date |
Description |
10-Nov, 2018 |
Release of the training set |
15-Jan, 2019 |
Release of the preliminary test set |
22-Jan, 2019 |
Last date for submission of results on the preliminary test set and submission of brief write-up |
23-Jan, 2019 |
Release of the ground truth of preliminary test set to top 10% participants (20 max.) |
14-Mar, 2019 |
Last date for submission of a detailed paper in ISBI format |
15-Mar, 2019 |
Release of the final test set |
17-Mar, 2019 |
Last date for submission of results on the final test set |
Top Entries of C-NMC 2019 Challenge |
||
Name |
Rank |
Weighted F1 Score |
Yongsheng Pan |
1 |
0.910 |
Ekansh Verma |
2 |
0.894 |
Jonas Prellberg |
3 |
0.889 |
Fenrui Xiao |
4 |
0.885 |
Tian Shi |
5 |
0.879 |
Ying Liu |
6 |
0.876 |
Salman Shah |
8 |
0.866 |
Yifan Ding |
9 |
0.855 |
Xinpeng Xie |
10 |
0.848 |
Start: Nov. 10, 2018, midnight
Description: Training phase: create models using the training set
Start: Jan. 15, 2019, midnight
Description: Preliminary Testing: Release of preliminary test set.
Start: March 15, 2019, midnight
Description: Final Testing: Release of final test set.
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | shubham14100 | 0.9526 |
2 | shiv_sbilab | 0.9486 |
3 | yspan | 0.9357 |