The main question of this challenge is: How to predict the survival of a patient given his or her medical record? More specifically, you will have to predict whether or not the patients died during their stay at the hospital.
Everyday, doctors and nurses collect a lot of information about patients by asking questions and using adapted tools (stethoscope, syringe, sensors, etc.). This data is very useful to monitor health state, diagnose and choose treatments.
It can also be used for statistical predictive analysis.
The training dataset contains information about 80,000 patients, represented by categorical, binary and numerical features (also named "variables"). Those features are for instance age, gender, ethnicity, marital status, as well as medical data such as blood pressure rate or glucose rate. There are a total of 342 variables.
The class (or label) to be predicted is a binary variable telling if the patient died or not during while in the hospital. Fortunately, most of them don't:
Which leads to an imbalanced classification problem. In the above graphic, 0 means "DIDN'T DIE" and 1 means "DIED".
The task is to create a model able to learns from the data and make predictions on unseen data. This is called supervised learning, because the algorithm learns from labeled data. Each data (each patient) has a label telling if he died or not during his stay at the hospital. It is possible to learn a link between data and labels and use it to make predictions: automatically label unseen data.
Isabelle Guyon, Kristin Bennett, Andrew Yale, Adrien Pavao, Thomas Gerspacher.
The problem is a binary classification problem. This means you have to label every row of the test set with a 0 or a 1, corresponding to the two classes "DIED" or "DIDN'T DIE".
The submissions are evaluated using the balanced accuracy metric (the average recall obtained on either class).
Accuracy score means the fraction of correctly classified examples. As the classes distribution (number of examples of each class) is imbalanced, the use of accuracy score wouldn't be fair: a model classifying every patient as "DIDN'T DIE" would get 94% of accuracy while being clearly useless as a predictive model. This is why we use balanced accuracy: an accuracy score that takes into account the classes distribution.
You may submit 10 submissions every day and 1000 in total.
This challenge is for educational purposes only, no prizes are awarded.
This challenge is governed by the general ChaLearn contest rules.
Download | Size (mb) | Phase |
---|---|---|
Starting Kit | 0.007 | #1 Final |
Public Data | 27.225 | #1 Final |
Start: June 15, 2018, midnight
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | zhuq2 | 0.77 |
2 | imriea | 0.77 |
3 | wangz45 | 0.76 |