Feature Selection Over Urban Growth Prediction Challenge

Organized by abeona - Current server time: April 3, 2025, 10:34 a.m. UTC

Previous

Development Phase
Feb. 1, 2021, midnight UTC

Current

Final Phase
April 30, 2022, midnight UTC

End

Competition Ends
Never

Urban Growth Prediction

 

CHALLENGE VIDEO

 

β€œExplosive growth of cities globally signifies the demographic transition from rural to urban, and is associated with shifts from an agriculture-based economy to mass industry, technology, and service. In principle, cities offer a more favorable setting for the resolution of social and environmental problems than rural areas. Cities generate jobs and income, and deliver education, health care and other services. Cities also present opportunities for social mobilization and women's empowerment.” - the World Bank.

Given the importance of urban growth, this project aims to use crowdsource power to develop a data-based model of future urban growth as a function of socio-economic indicators. The goal is to find a minimal set of indicators that can explain to a certain degree the population growth in these areas. For this purpose we will design a challenge on how to best model this indicator using the least number of features while having a good accuracy. The winning model can help governments to create better policies to increase wellness and regulate urban growth to avoid problems brought by either extreme: stagnation and overgrowth. This model can equally be applied for businesses. Real estate agencies, for example, can use these results to better plan their resources on the upcoming year.

The novelty of this challenge lies in using crowdsourcing for feature selection (NP-Hard) on geographical, social and economical indicators for urban growth modelling. Human insight is very important to this problem because of the configuration of present data. The dataset for this problem has much more features than examples and the number of possible subsets of features is far greater than the larger estimation of the total number of atoms in the universe.

Your challenge is to predict the urban growth of a country for a year having a huge amount of data about the previous year of this country.

 

Our team

Mykola LIASHUHA, Guilherme SALES SANTA CRUZ, Louis LAMALLE, Mohamed Salem MESSOUD, Ousmane CissΓ©, Romain JAMINET 

 

Contact

 mykola.liashuha@gmail.com

 

Data

The data comes from the World Bank Data and contains the main socio-economic indexes of the countries. Link for the word bank website : here.

 

For submission, only submit code_submission without saved model is accepted. Other options of submission (code_submissions with saved model and result_submissions) would not be taken into account during final evaluation.

View output information about used probe features in "View scoring output log" section of the submssion

 

This research was [partially] supported by Labex DigiCosme (project ANR11LABEX0045DIGICOSME) operated by ANR as part of the program Investissement d’Avenir Idex ParisSaclay (ANR11IDEX000302)

Evaluation

For submission, only submit code_submission without saved model is accepted. Other options of submission (code_submissions with saved model and result_submissions) would not be taken into account during final evaluation.

View output information about used probe features in "View scoring output log" section of the submssion

The criteria for the task is, respectively, the least number of features used to model and the highest accuracy. The minimum accuracy measured by feature_selection_metric is 33. 

The metric used in the scoring program and the iPython notebook is our own metric called β€œfeature_selection_metric”. For this challenge we have a score metric based on two criteria: performance and least number of features. The score is computed in a way that there is an equivalence between the criteria where an improvement of 10% in the performance is equivalent to reducing two features from the input space.

To get the best score, your goal is to have the lower number of features with the best performances. 

This graph illustrates that:

 

 

π‘™π‘œπ‘”0.99(π‘π‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’)-#π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ *7, where performance is measure by R2 metric that explain accuracy of regression predictions.

Also, there are some garbage features that were inserted in the dataset to assess your selection method. The final score will be represented by the original score multiplied by the percentage of good features selected in your input.

𝑝=(used_features - used_probes_features)/used_features

So final evaluation formula is:

π‘ π‘π‘œπ‘Ÿπ‘’=𝑝×(π‘™π‘œπ‘”0.99(1βˆ’π‘…2)-#π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’π‘ *7)

 

You are given for training a data matrix X_train of dimension 15290 x 29884 and an array y_train of labels of dimension num_training_samples. You must train a model which predicts the labels for two test matrices X_valid and X_test.
There are 2 phases:

  • Phase 1: development phase. We provide you with labeled training data and unlabeled validation and test data. Make predictions for both datasets. However, you will receive feed-back on your performance on the validation set only. The performance of your LAST submission will be displayed on the leaderboard.
  • Phase 2: final phase. You do not need to do anything. Your last submission of phase 1 will be automatically forwarded. Your performance on the test set will appear on the leaderboard when the organizers finish checking the submissions.

This sample competition allows you to submit either:

  • Only prediction results (no code).
  • A pre-trained prediction model.
  • A prediction model that must be trained and tested. <- Only this one would be accepted for ranking

The submissions are evaluated using the accuracy and the number of features metric.

Rules

For submission, only submit code_submission without saved model is accepted. Other options of submission (code_submissions with saved model and result_submissions) would not be taken into account during final evaluation.

View output information about used probe features in "View scoring output log" section of the submssion

 

 

Submissions must be made before the end of phase 1. You may submit 15 submissions every day and 100 in total.

This challenge is for educational purposes only and no prizes are granted. It is governed by the general ChaLearn contest rules.

Development Phase

Start: Feb. 1, 2021, midnight

Description: Development phase: tune your models and submit prediction results, trained model, or untrained model.

Final Phase

Start: April 30, 2022, midnight

Description: Final phase (no submission, your last submission from the previous phase is automatically forwarded).

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 mdsalem17 26.4017
2 abeona 15.4747