βExplosive growth of cities globally signifies the demographic transition from rural to urban, and is associated with shifts from an agriculture-based economy to mass industry, technology, and service. In principle, cities offer a more favorable setting for the resolution of social and environmental problems than rural areas. Cities generate jobs and income, and deliver education, health care and other services. Cities also present opportunities for social mobilization and women's empowerment.β - the World Bank.
Given the importance of urban growth, this project aims to use crowdsource power to develop a data-based model of future urban growth as a function of socio-economic indicators. The goal is to find a minimal set of indicators that can explain to a certain degree the population growth in these areas. For this purpose we will design a challenge on how to best model this indicator using the least number of features while having a good accuracy. The winning model can help governments to create better policies to increase wellness and regulate urban growth to avoid problems brought by either extreme: stagnation and overgrowth. This model can equally be applied for businesses. Real estate agencies, for example, can use these results to better plan their resources on the upcoming year.
The novelty of this challenge lies in using crowdsourcing for feature selection (NP-Hard) on geographical, social and economical indicators for urban growth modelling. Human insight is very important to this problem because of the configuration of present data. The dataset for this problem has much more features than examples and the number of possible subsets of features is far greater than the larger estimation of the total number of atoms in the universe.
Your challenge is to predict the urban growth of a country for a year having a huge amount of data about the previous year of this country.
Mykola LIASHUHA, Guilherme SALES SANTA CRUZ, Louis LAMALLE, Mohamed Salem MESSOUD, Ousmane CissΓ©, Romain JAMINET
mykola.liashuha@gmail.com
The data comes from the World Bank Data and contains the main socio-economic indexes of the countries. Link for the word bank website : here.
This research was [partially] supported by Labex DigiCosme (project ANR11LABEX0045DIGICOSME) operated by ANR as part of the program Investissement dβAvenir Idex ParisSaclay (ANR11IDEX000302)
The criteria for the task is, respectively, the least number of features used to model and the highest accuracy. The minimum accuracy measured by feature_selection_metric is 33.
The metric used in the scoring program and the iPython notebook is our own metric called βfeature_selection_metricβ. For this challenge we have a score metric based on two criteria: performance and least number of features. The score is computed in a way that there is an equivalence between the criteria where an improvement of 10% in the performance is equivalent to reducing two features from the input space.
To get the best score, your goal is to have the lower number of features with the best performances.
This graph illustrates that:
πππ0.99(πππππππππππ)-#ππππ‘π’πππ *7, where performance is measure by R2 metric that explain accuracy of regression predictions.
Also, there are some garbage features that were inserted in the dataset to assess your selection method. The final score will be represented by the original score multiplied by the percentage of good features selected in your input.
π=(used_features - used_probes_features)/used_features
So final evaluation formula is:
π ππππ=πΓ(πππ0.99(1βπ 2)-#ππππ‘π’πππ *7)
You are given for training a data matrix X_train of dimension 15290 x 29884 and an array y_train of labels of dimension num_training_samples. You must train a model which predicts the labels for two test matrices X_valid and X_test.
There are 2 phases:
This sample competition allows you to submit either:
The submissions are evaluated using the accuracy and the number of features metric.
Submissions must be made before the end of phase 1. You may submit 15 submissions every day and 100 in total.
This challenge is for educational purposes only and no prizes are granted. It is governed by the general ChaLearn contest rules.
Start: Feb. 1, 2021, midnight
Description: Development phase: tune your models and submit prediction results, trained model, or untrained model.
Start: April 30, 2022, midnight
Description: Final phase (no submission, your last submission from the previous phase is automatically forwarded).
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | mdsalem17 | 26.4017 |
2 | abeona | 15.4747 |