Cell classification via image processing has recently gained interest from the point of view of building computer assisted diagnostic tools for blood disorders such as leukemia. In order to arrive at conclusive decision on disease diagnosis and degree of progression, it is very important to identify malignant cells with high accuracy. Computer assisted tools can be very helpful in automating the process of cell segmentation and identification. Identification of maIignant cells vis-à-vis normal cells from the microscopic images is difficult because morphologically both cells types appear similar.
As a consequence, leukemia (blood cancer) is detected in advanced cancer stages via microscopic image analysis, not because of the ability to identify these under the microscope, but because of the medical domain knowledge, i.e., the cancer cells start growing in an unrestricted fashion and hence, they are present in much more larger numbers as compared to their numbers in a normal person.
It is, however, important to do early disease diagnosis for better cure and for improving the overall survival of the subjects suffering with cancer. Although advanced methods such as flow cytometry are available, they are very expensive and are not available widely in pathology laboratories or hospitals, particularly, in rural areas. On the other hand, a computer based solution can be deployed easily at a much lesser cost. It is hypothesized that advanced methods of medical image processing can lead to the identification of normal versus malignant cells and hence, can aid in the diagnosis of cancer in a cost effective manner.
Hence, this is an effort to build an automated classifier that will overcome the problems associated with deploying sophisticated high-end machines with recurring reagent cost. It will also aid pathologists and oncologists to make quicker and data driven inferences.
In this challenge, a dataset of cells with labels (normal versus malignant) will be provided to train machine learning based classifier to identify normal cells from leukemic blasts (malignant cells). These cells have been segmented from the images after those images have been stain normalized. The overall size of the images were 2560x1920, while a single cell image is roughly of the size of 300x300 pixels. The images are representative of images in the real-world because these contain some staining noise and illumination errors, although these errors have largely been fixed by us via our own inhouse method on stain normalization [1,2].
The ground truth has been marked by the expert oncologist. The aim of the challenge is to compare and rank the different competing methods developed by participants based on the performance metrics such as accuracy and F1 scores.
This problem is very challenging because as stated above, morphologically, the two cell types appear very similar. The ground truth has been marked by the expert based on the domain knowledge. Also, with our efforts in the past two years, we have also recognized that the subject level variability also plays a key role and as a consequence, it is challenging to build a classifier that can yield good results on prospective data. Anyone deeply interested in working on a challenging problem of medical image classification via building newer deep learning/machine learning architectures would, in our opinion, come forward to work on this challenge.
References and credits:
 Anubha Gupta, Rahul Duggal, Ritu Gupta, Lalit Kumar, Nisarg Thakkar, and Devprakash Satpathy, “GCTI-SN: Geometry-Inspired Chemical and Tissue Invariant Stain Normalization of Microscopic Medical Images,” under review. .
 Ritu Gupta, Pramit Mallick, Rahul Duggal, Anubha Gupta, and Ojaswa Sharma, "Stain Color Normalization and Segmentation of Plasma Cells in Microscopic Images as a Prelude to Development of Computer Assisted Automated Disease Diagnostic Tool in Multiple Myeloma," 16th International Myeloma Workshop (IMW), India, March 2017.
This is a two-class classification problem. The training and testing data contains folder level division of individual subjects. Example directory tree for data division is shown below:
The train folder contains 3 patient level splits. i.e. Fold1 contains full data from patient IDs 1,2,3,4,5 and Fold2 contains full data from patient IDs 6, 7, 8, 9,10. No two splits overlap in terms of patient data. i.e. Patient ID found in Fold1 will only be present in Fold1.
Image7, Image8, …
Image9, Image10, …
All the image names follow a standard naming convention which is described below:
UID_1 -> The 1 in the end Signifies the patient ID.
UID_1_2: Number 2 represent the image number
UID_1_2_1: Number 1 in the end represent the cell count. (More than one cell can be found in a particular microscopic image)
UID_1_2_1_all : The ‘all’ tag represent the class to which the cell belongs, in this case, ‘ALL’ or cancerous class.
The dataset contains a total of 154 individual subjects/cases, distributed as follows:
Data Availability Schedule: The data will be released in three phases.
Phase 1: During the first phase, we will release the training set. The training set consists of data from 47 ALL subjects and 29 Normal subjects, containing a total cells images of 7272 ALL and 3389 normal cells. The training set will contain the ground truth labels of the cell images.
Phase 2: During the second phase, we will release the preliminary test set. It consists of data from 15 ALL subjects and 15 Normal subjects. The ground truth will also be made available to the top performing participants (refer 'Evaluation' section).
Phase 3: During the third and final phase, we will release the final test set. It consists of data from 11 ALL subjects and 8 Normal subjects.
The test set would only be made available to each participating team which submitted a detailed paper at the end of phase 2 (refer 'Evaluation' section). The participants would be given a limited time window (refer 'Important Dates' section) after the release of the test dataset. Participants are required to make the final submission within the given time frame.
What general pre-processing steps will be performed?
The data is already preprocessed and does not require any further processing. However, participants are free to apply any further processing techniques, if required.
The dataset is imbalanced and participants need to consider this fact while training the model in order to have satisfactory performance on the test set. Also, subject level variability would play an important role while evaluation of the trained model on the test set. Hence, it is recommended that this aspect is considered while training of the model. For example, training and validation splits may be done at subject level instead of at image level. In case this is not addressed, trained model may lead to poor performance on prospective subjects' data.
Based upon ranking in the leaderboard and submitted write-up, ground truth of the preliminary test set will be released to the top 25 (or more) teams/individuals. This ground truth can be used by the participants for re-training/fine-tuning the model so as to give better performance on final test set. Finally, at the end of phase 2, pariticipants are required to submit a detailed paper in ISBI format with the following results on the preliminary test set and supplementary files conataing training curves and all the relevent results:
The top ten (or more) scorers on the leaderboard will be invited to attend the ISBI-2019 workshop. These participants will also be required to provide developed code and trained models to the organizing team.
Scikit-learn library is to be used for calculating weighted-precision, weighted-recall and weighted-f1 score. Please note that 'weighted' metrics will be used for performance evaluation purpose.
 Hammack, Daniel, and Julian de Wit. “Predicting Lung Cancer” blog.kaggle.com. http://blog.kaggle.com/2017/06/29/2017-data-science-bowl-predicting-lung-cancer-2nd-place-solution-write-up-daniel-hammack-and-julian-de-wit/ (Accessed 17th August 2018)
The link for downloading the data for phase-1 will soon be made available to the participants. Participants must register and agree to data usage terms before downloading the data. We will soon be publishing this data on The Cancer Imaging Archive. A supporting data article will be released in Elsevier Data-in-brief, DOI would be updated soon on the challenge website. All participants are requested to cite this data article in any publication which uses the challenge dataset.
There will be no deadline extension for submissions in phase two and phase three.
The test datasets will be made available only to the qualifying teams of Phase 2 and 3 of this competition and will not be distributed freely.
The following five papers need to be cited if the dataset provided in this challenge is used for any publication:
Anubha Gupta, Rahul Duggal, Ritu Gupta, Lalit Kumar, Nisarg Thakkar, and Devprakash Satpathy, “GCTI-SN: Geometry-Inspired Chemical and Tissue Invariant Stain Normalization of Microscopic Medical Images,”, under review.
Ritu Gupta, Pramit Mallick, Rahul Duggal, Anubha Gupta, and Ojaswa Sharma, "Stain Color Normalization and Segmentation of Plasma Cells in Microscopic Images as a Prelude to Development of Computer Assisted Automated Disease Diagnostic Tool in Multiple Myeloma," 16th International Myeloma Workshop (IMW), India, March 2017.
We gratefully acknowledge the research funding support (Grant: EMR/2016/006183) from the Department of Science and Technology, Govt. of India for this research work.
|10-Nov, 2018||Release of the training set|
|15-Jan, 2019||Release of preliminary test set|
|22-Jan, 2019||Last date for submission of results on preliminary test set and submission of brief write-up|
|23-Jan, 2019||Release of ground truth of preliminary test set to top 25 participating team.|
|14-Mar, 2019||Last date for submission of a detailed paper in ISBI format with detailed results in supplementary on the preliminary test set|
|15-Mar, 2019||Release of the final test set|
|17-Mar, 2019||Last date for submission of results on the final test set|
Start: Nov. 10, 2018, midnight
Description: Training phase: create models using the training set
Start: Jan. 15, 2019, midnight
Description: Preliminary Testing: Release of preliminary test set.
Start: March 15, 2019, midnight
Description: Final Testing: Release of final test set.
March 18, 2019, midnight
You must be logged in to participate in competitions.Sign In