AutoML Round 1 :: ICML hackathon

Secret url:
Organized by lukasz.romaszko - Current server time: July 5, 2020, 4:53 p.m. UTC


Feb. 20, 2015, 11 p.m. UTC


July 11, 2015, 9:05 p.m. UTC


Competition Ends

ICML AutoML hackathon, July 11, Lille

This is a clone of the AutoML challenge website created for the ICML 2015 AutoML workshop


The round 0 of the AutoML is available here .

About the AutoML Challenge:
This is a "supervised learning" challenge in machine learning. We are making available 30 datasets, all pre-formatted in given feature representations (this means that each example consists of a fixed number of numerical coefficients). The challenge is to solve classification and regression problems, without any further human intervention.

See the AutoML website for the full challenge.


This challenge is brought to you by ChaLearn. Contact the organizers.



This challenge is concerned with regression and classification problems (binary, multi-class, or multi-label) from data already formatted in fixed-length feature-vector representations. Each task is associated with a dataset coming from a real application. The domains of application are very diverse and are drawn from: biology and medicine, ecology, energy and sustainability management, image, text, audio, speech, video and other sensor data processing, internet social media management and advertising, market analysis and financial prediction.
All datasets present themselves in the form of data matrices with samples in lines and features (or variables) in columns. For instance, in a medical application, the samples may represent patient records and the features may represent results of laboratory analyses. The goal is to predict a target value, for instance the diagnosis "diseased" or "healthy" in the case of a medical diagnosis problem.
The identity of the datasets and the features is concealed (except in round 0) to avoid the use of domain knowledge and push the participants to design fully automated machine learning solutions.
In addition, the tasks are constrained by:

  • A Time Budget.
  • A Scoring Metric.

Task, scoring metric and time budget are provided with the data, in a special "info" file.

Time Budget

The Codalab platform provides computational resources shared by all participants. To ensure the fairness of the evaluation, when a code submission is evaluated, its execution time is limited to a given Time Budget, which varies from dataset to dataset. The time budget is provided with each dataset in its "info" file. The organizers reserve the right to adjust the time budget by supplying the participants with new info files.
The participants who submit results (instead of code) are NOT constrained by the Time Budget, since they can run their code on their own platform. This may be advantageous for entries counting towards the Final phases (immediately following a Tweakathon). The participants wishing to also enter the AutoML phases, which require submitting code, can submit BOTH results and code (simultaneously). See the Instructions for details.

Scoring Metrics

The scoring program computes a score by comparing submitted predictions with reference "target values". For each sample i, i=1:P, the target value is:

  • a continuous numeric coefficient yi, for regression problem;
  • a vector of binary indicators [yik] in {0, 1}, for multi-class or multi-label classification problems (one per class k);
  • a single binary indicator yi in {0, 1}, for binary classification problems.

The participants must turn in prediction values matching as closely as possible the target value, in the form of:

  • a continuous numeric coefficient qi for regression problem;
  • a vector of numeric coefficients [qik] in the range [0, 1] for multi-class or multi-label classification problems (one per class k);
  • a single numeric coefficients qi in the range [0, 1] for binary classification problems.

The Starting Kit contains the Python implementation of all scoring metrics used to evaluate the entries. Each dataset has its own metric (scoring criterion), specified in its "info" file. All scores are re-normalized such that the expected value of the score for a "trivial guess" based on class prior probabilities is 0 and the optimal score is 1. Multi-label problems are treated as multiple binary classification problems and are evaluated by the average of the scores of each binary classification sub-problem.
The scores are taken from the following list:

  • R2: R-square or "coefficient of determination" used for regression problems: R2 = 1-MSE/VAR, where MSE=< (yi - qi)2> is the mean-square-error and VAR= < (yi - m)2> is the variance, with m=< yi >.
  • ABS: A coefficient similar to the R2 but based on mean absolute error (MAE) and mean absolute deviation (MAD): ABS =  1-MAE/MAD, with MAE=< abs(yi - qi) > and MAD=< abs(yi - m) >.
  • BAC: Balanced accuracy, which is the average of class-wise accuracy for classification problems (or the average of sensitivity (true positive rate) and specificity (true negative rate) for the special case of binary classification). For binary classification problems, the class-wise accuracy is the fraction of correct class predictions when qi is thresholded at 0.5, for each class. The class-wise accuracy is averaged over all classes for multi-label problems. For multi-class classification problems, the predictions are binarized by selecting the class with maximum prediction value argmaxk qik before computing the class-wise accuracy. We normalize the BAC with the formula BAC := (BAC-R)/(1-R), where R is the expected value of BAC for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).
  • AUC: Area under the ROC curve, used for ranking and for binary classification problems. The ROC curve is the curve of sensitivity vs. 1-specificity, when a threshold is varied on the predictions. The AUC is identical to the BAC for binary predictions. The AUC is calculated for each class separately before averaging over all classes. We normalize it with the formula: AUC := 2AUC-1, making it de-facto identical to the so-called Gini index.
  • F1 score: The harmonic mean of precision and recall. Precision=positive predictive value=true_positive/all_called_positive. Recall=sensitivity=true positive rate=true_positive/all_real_positive. Prediction thresholding and class averaging is handled similarly as in the case of the BAC. We also normalize F1 with F1 := (F1-R)/(1-R), where R is the expected value of F1 for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).
  • PAC: Probabilistic accuracy PAC = exp(- CE) based on the cross-entropy or log loss, CE = - < sumk log(qik) > for multi-class classification and CE = - <yi log(qi) + (1-yi) log(1-qi)> for binary classification and multi-label problems. Class averaging is performed after taking the exponential in the multi-label case. We normalize with PAC := (PAC-R)/(1-R), where R is the score obtained using qi =< yi > or qik=< yik > (i.e. using as predictions the fraction of positive class examples as an estimate of the prior probability).

We note that for R2, ABS, and PAC the normalization uses a "trivial guess" corresponding to the average target value qi =< yi > or qik=< yik >. In contrast, for BAC, AUC, and F1 the "trivial guess" is a random prediction of one of the classes with uniform probability.
In all formulas the brackets < . > designates the average over all P samples indexed by i: < yi > = (1/P) sumi (yi). Only R2 and ABS make sense for regression; we compute the other scores for completeness by replacing the target values by binary values after thresholding them in the mid-range.

Leaderboard score calculation

Each round includes five datasets from different application domains, spanning various levels of difficulty. The participants (or their submitted programs) provide prediction results for the withheld target values (called "solution"), for all 5 datasets. Independently of any intervention of the participants, the original version of the scoring program supplied by the organizers is run on the server to compute the scores. For each dataset, the participants are ranked in decreasing order of performance for the prescribed scoring metric associated with the given task. The overall score is computed by averaging the ranks over all 5 datasets and shown in the column <rank> on the leaderboard.

We ask the participants to test their systems regularly while training to produce intermediate prediction results, which will allow us to make learning curves (performance as a function of training time). Using such learning curves, we will adjust the "time budget" in subsequent rounds (eventually giving you more computational time!). But only the last point (corresponding to the file with the largest order number) is used for leaderboard calculations.

The results of the LAST submission made are used to compute the leaderboard results (so you must re-submit an older entry that you prefer if you want it to count as your final entry). This is what is meant by “Leaderboard modifying disallowed”. In phases marked with a [+], the participants with the three smallest <rank> are eligible for prizes, if they meet the Terms and Conditions.

Training, validation and test sets

For each dataset, a labeled training set is provided for training and two unlabeled sets (validation set and test set) are provided for testing.

Phases and rounds

The challenge is run in multiple Phases grouped in rounds, alternating AutoML contests and Tweakathons. There are 6 six rounds: Round 0 (Preparation round), followed by 5 rounds of progressive difficulty (Novice, Intermediate, Advanced, Expert, and Master). Except for round 0 (preparation) and round 5 (termination), all rounds include 3 phases, alternating Tweakathons and AutoML contests:

Phase in round [n] Goal Duration Submissions Data Leaderboard scores Prizes
[+] AutoML[n] Blind test of code Short NONE (code migrated) New datasets, not downloadable Test set results Yes
Tweakathon[n] Manual tweaking 1 month Code and/or results Datasets downloadable Validation set results No
[+] Final[n] Results of Tweakathon revealed Short NONE (results migrated) NA Test set results Yes

The results of the last submission made are shown on the leaderboard. Submissions are made in Tweakathon phases only. The last submission of one phase migrates automatically to the next one. If code is submitted, this makes it possible to participate to subsequent phases without making new submissions. Prizes are attributed for phases marked with a [+] during which there is NO submission. The total prize pool is $30,000 (see Rewards and Terms and Conditions for details).

Code vs. result submission

To participate in the AutoML[n] phase, code must be submitted in Tweakathon[n-1]. To participate in the Final[n], code or results must be submitted in Tweakathon[n]. If both code and (well-formatted) results are submitted, in  Tweakathon[n] the results are used for scoring rather than re-running the code in Tweakathon[n] and Final[n]. The code is executed when results are unavailable or not well formatted. Hence there is no disadvantage to submitting both results and code. There is no obligation to submit the code, which has produced the results provided. Using mixed submissions of results and code, different methods can be used to enter the Tweakathon/Final phases and to enter the AutoML phases.


There are 5 datasets in each round spanning a range of difficulties:

  • Different tasks: regression, binary classification, multi-class classification, multi-label classification.
  • Class balance: Balanced or unbalanced class proportions.
  • Sparsity: Full matrices or sparse matrices.
  • Missing values: Presence or absence of missing values.
  • Categorical variables: Presence or absence of categorical variables.
  • Irrelevant variables: Presence or absence of additional irrelevant variables (distractors).
  • Number Ptr of training examples: Small or large number of training examples.
  • Number N of variables/features: Small or large number of variables.
  • Aspect ratio Ptr/N of the training data matrix: Ptr>>N, Ptr~=N or Ptr<<N.

We will progressively introduce difficulties from round to round (each round cumulating all the difficulties of the previous ones plus new ones): Some datasets may be recycled from previous challenges, but reformatted into new representations, except for the final MASTER round, which includes only completely new data.

  1. NOVICE: Binary classification problems only; no missing data; no categorical variables; moderate number of features (<2000); balanced classes; BUT sparse and full matrices; presence of irrelevant variables; various Ptr/N.
  2. INTERMEDIATE: Multi-class and binary classification problems + additional difficulties including: unbalanced classes; small and large number of classes (several hundred); some missing values; some categorical variables; up to 5000 features.
  3. ADVANCED: All types of classification problems, including multi-label + additional difficulties including: up to 300,000 features.
  4. EXPERT: Classification and regression problems, all difficulties.
  5. MASTER: Classification and regression problems, all difficulties, completely new datasets.



This challenge is brought to you by ChaLearn. Contact the organizers.

Challenge Rules

  • General Terms: This challenge is governed by the General ChaLearn Contest Rule Terms, the Codalab Terms and Conditions, and the specific rules set forth.
  • Announcements: To receive announcements and be informed of any change in rules, the participants must provide a valid email.
  • Conditions of participation: Participation requires complying with the rules of the challenge. Prize eligibility is restricted by US government export regulations, see the General ChaLearn Contest Rule Terms. The organizers, sponsors, their students, close family members (parents, sibling, spouse or children) and household members, as well as any person having had access to the truth values or to any information about the data or the challenge design giving him (or her) an unfair advantage, are excluded from participation. A disqualified person may submit one or several entries in the challenge and request to have them evaluated, provided that they notify the organizers of their conflict of interest. If a disqualified person submits an entry, this entry will not be part of the final ranking and does not qualify for prizes. The participants should be aware that ChaLearn and the organizers reserve the right to evaluate for scientific purposes any entry made in the challenge, whether or not it qualifies for prizes.
  • Dissemination: The participants will be invited to attend a workshop organized in conjunction with a major machine learning conference and contribute to the proceedings. The challenge is part of the competition program of the IJCNN 2015 conference.
  • Registration: The participants must register to Codalab and provide a valid email address. Teams must register only once and provide a group email, which is forwarded to all team members. Teams or solo participants registering multiple times to gain an advantage in the competition may be disqualified.
  • Anonymity: The participants who do not present their results at the workshop can elect to remain anonymous by using a pseudonym. Their results will be published on the leaderboard under that pseudonym, and their real name will remain confidential. However, the participants must disclose their real identity to the organizers to claim any prize they might win. See our privacy policy for details.
  • Submission method: The results must be submitted through this CodaLab competition site. The participants can make up to 5 submissions per day in the Tweakathon phases. Using multiple accounts to increase the number of submissions in NOT permitted. There are NO submissions in the Final and AutoML phases (the submissions from the previous Tweakathon phase migrate automatically). In case of problem, send email to The entries must be formatted as specified on the Evaluation page.
  • Awards: There are no awards for this event. The goal is to learn about AutoML and form teams to enter the AutoML challenge.


This challenge is brought to you by ChaLearn. Contact the organizers.



The datasets are downloadable from the Dataset page.

Code or result submission

The participants must submit a zip file with their code and/or results via the Submission page. Get started in minutes: we provide a kit including sample submissions and step-by-step instructions. Starting Kit

Participation does not require submitting code, but, if you submit code for evaluation in a given AutoML phase, it must be submitted during the Tweakathon of the PREVIOUS round. ONLY TWEAKATHON PHASES TAKE SUBMISSIONS. Phases marked with a [+] report results on submissions that are forwarded automatically from the previous phase.

The sample submission can be used to submit results, code, or both:

  • Result submission: To submit prediction results, you must run your code on your own machine. You will need first to download the Datasets and the Starting Kit. Always submit both validation and test set results simultaneously, to be ranked on the leaderboard during the "Tweakathon" phase (using the validation set) and during the "Final" phase (using the test set). Result submissions will NOT allow you to participate in the "AutoML" phase.
  • Code submission: We presently support submission of Python code. An example is given in the Starting Kit. If you want to make entries with other languages, please contact us. In principle, the Codalab platform can accept submissions of any Linux executable, but this has not been test yet. If you submit code, make sure it produces results on both validation and test data. It will be used for training and testing in all subsequent phases and rounds until you submit new code.
  • Result and code submission: If you submit both results and code, your results will be used for the Tweakathon and Final phases of the present round; your code will be used for the next AutoML phase (and all subsequent phases and rounds), unless you submit new code.

There is no disadvantage to submit both results and code. The results do not need to have been produced by the code you submit. For instance, you can submit the sample code together with your results if you do not want to submit your own code. You can submit results of models manually tweaked during the Tweakathon phases.

Input format and computational restrictions

The input format is specified on the Dataset page. It includes the prescribed "time budget" for each task (in seconds), which is different for each dataset. In round 0, the total time allowed for all tasks is about half an hour, so BE PATIENT this is how long it will take for the sample code we provide to run when you submit it. Submissions of results are processed much faster, in a few minutes.

Result submission format

A sample result submission is provided with the Starting Kit. All result files should be formatted as text files ending with a ".predict" extension, with one result per sample per line, in the order of the samples:

  • Regression problems: one numeric value per line.
  • Binary classification problems: one numeric value between 0 and 1 to per line, indicating a score of class 1 membership (1 is certainty of class 1, 0.5 is a random guess, 0 is certainty of class 0).
  • Multiclass or multilabel problems: for C classes, C numeric values between 0 and 1 per line, indicating the scores of membership of the C classes. The scores add up to 1 for multiclass problems only.

We ask the participants to test their models regularly and produce intermediate prediction results, numbered from num=0 to n. The following naming convention of the files should be respected:
where "basename" is the dataset name (e.g. adult, cadata, digits, dorothea, or newsgroups, in the first round), "setname" is either "valid" (validation set) or "test" (test set) and "num" is the order number of prediction results submitted. Please use the format 03d to number your submissions because we sort the file names in alphabetical order to determine the result order.

For example, in the first round, you would bundle for submission the following files in a zip archive (no directory structure):

  • adult_valid_000.predict
  • adult_valid_001.predict
  • adult_valid_002.predict
  • ...
  • adult_test_000.predict
  • adult_test_001.predict
  • adult_test_002.predict
  • ...
  • cadata_valid_000.predict
  • cadata_valid_001.predict
  • cadata_valid_002.predict
  • ...
  • cadata_test_000.predict
  • cadata_test_001.predict
  • cadata_test_002.predict
  • ...
  • etc.

The last result file for each set (with largest number num) is used for scoring. It is useful however to provide intermediate results: ALL the results are used by the organizers to make learning curves and infer whether performance improvements could be gained by increasing the time budget. This will affect the time budget allotted in subsequent rounds.


This challenge is brought to you by ChaLearn. Contact the organizers.


Please subscribe to our Google group to post messages on the forum send email to


This challenge is brought to you by ChaLearn. Contact the organizers.


Where can I download the data?

From the Data page, under the Participate tab. You first need to register to have acces to it.

How do I make submissions?

Register and go to the Participate tab where you find data, and a submission form.

Do you provide tips on how to get started?

We provide a Starting Kit, see Step-by-step instructions.

Are there prizes?


Do I need to submit code to participate?

No. This hackathon is with result submission only.

Can I give an arbitrary hard time to the organizers?


Where can I get additional help?

For questions of general interest, the participants may subscribe to our Google group to post messages on the forum send email to


This challenge is brought to you by ChaLearn. Contact the organizers.


The organization of this challenge would not have been possible without the help of many people who are gratefully acknowledged.


Any opinions, findings, and conclusions or recommendations expressed in material found on this website are those of their respective authors and do not necessarily reflect the views of the sponsors. The support of the sponsors does not give them any particular right to the software and findings of the participants.


Microsoft supported the organization of this challenge and donated the prizes.


This challenge is part of the official selection of IJCNN competitions.

LIF Archimede AMU ETH

This project received additional support from the Laboratoire d'Informatique Fondamentale (LIF, UMR CNRS 7279) of the University of Aix Marseille, France, via the LabeX Archimede program. Computing resources were provided generously by Joachim Buhmann, ETH Zuerich.


Isabelle Guyon, ChaLearn, Berkeley, California, USA
Evelyne Viegas, Microsoft Research, Redmond, Washington, USA

Data providers:

We selected the 30 datasets used in the challenge among 72 datasets that were donated or formatted using data publicly available by:
Yindalon Aphinyanaphongs, New-York University, New-York, USA
Olivier Chapelle, Criteo, California, USA
Hugo Jair Escalante, INAOE, Puebla, Mexico    
Sergio Escalera, University of Barcelona, Catalonia, Spain
Isabelle Guyon, ChaLearn, Berkeley, California, USA
Zainab Iftikhar Malhi, University of Lahore, Pakistan
Vincent Lemaire, Orange research, Lannion, Britany, France
Chih Jen Lin, National Taiwan University, Taiwan
Meysam Madani, University of Barcelona, Catalonia, Spain
Bisakha Ray, New-York University, New-York, USA
Mehreen Saeed, University of Lahore, Pakistan
Alexander Statnikov, American Express, New-York, USA
Gustavo Stolovitzky, IBM Computational Biology Center, Yorktown Heights, New York, USA
Hans-Jürgen Thiesen, Universität Rostock, Germany
Ioannis Tsamardinos, University of Crete, Greece

Committee members, advisors and beta testers:

Kristin Bennett, RPI, New-York, USA
Richard Caruana, Microsoft Research, Redmond, Washington, USA
Igor Chikalov, Intel, USA
Gideon Dror, Yahoo!, Haifa, Israel
Hugo Jair Escalante, INAOE, Puebla, Mexico
Sergio Escalera, University of Barcelona, Catalonia, Spain
Tin Kam Ho, IBM Research, Yortown Heights, New-York, USA
Frank Hutter, Freiburg University, Germany
Hugo Larochelle, Université de Sherbrooke, Canada
Vincent Lemaire, Orange research, Lannion, Britany, France
Chih Jen Lin, National Taiwan University, Taiwan
Víctor Ponce López, University of Barcelona, Catalonia, Spain
Nuria Macia, Universitat Ramon Llull, Barcelona, Spain
Simon Mercer, Microsoft, Redmond, Washington, USA
Florin Popescu, Fraunhofer First, Berkin, Germany
Mehreen Saeed, University of Lahore, Pakistan
Danny Silver, Acadia University, Wolfville, Nova Scotia, Canada
Alexander Statnikov, American Express, New-York, USA
Ioannis Tsamardinos, University of Crete, Greece
Eugene Tuv, Intel, USA

Codalab and other software development

Eric Camichael, Tivix, San Francisco, California, USA
Isabelle Guyon, ChaLearn, Berkeley, California, USA
Ivan Judson, Microsoft, Redmond, Washington, USA
Christophe Poulain, Microsoft Research, Redmond, Washington, USA
Percy Liang, Stanford University, Palo Alto, California, USA
Arthur Pesah, Lycée Henri IV, Paris, France
Xavier Baro Sole, University of Barcelona, Barcelona, Spain
Erick Watson, Sabthok International, Redmond, Washington, USA
Michael Zyskowski, Microsoft Research, Redmond, Washington, USA



This challenge is brought to you by ChaLearn. Contact the organizers.


Start: Feb. 20, 2015, 11 p.m.

Description: Round 1 :: public leaderboad. Max 20 submissions. Time limit: 10 minutes (submit results)


Start: July 11, 2015, 9:05 p.m.

Description: Results on test data of phase 1. There is NO NEW SUBMISSION. The results on test data of the last submission are shown.

Competition Ends


You must be logged in to participate in competitions.

Sign In