I noticed that for each dataset, there is a info file that describes the the type of each feature. Since the data preprocessing is done by the data_io.py, it seems that I don't have access to this information in model.py. The Model.fit inputs only X, y of np.array type. So how can I get the feature type (numerical or categorical) ? Thanks a lot.Posted by: iamrobot @ Aug. 24, 2018, 2:21 a.m.
Thanks for your question, you just have to check at the data manager file to look in what order features are appended to the X matrix
HJPosted by: hugo.jair @ Aug. 24, 2018, 7:16 a.m.
Thanks for the reply. I see the DataManager is capable of getting the feature infos. However, the Model class does not have any access to it (No such argument in Model.__init__). It occurs to me that the only code I can modify and upload is the model.py. Anything else is the framework which is immutable.
After careful look through the rules, I found that all the feature columns of each dataset have been published. Does it means that I should hardcode the feature format and assume the datasets are fed in A,B,C,D,E order?Posted by: iamrobot @ Aug. 24, 2018, 8:56 a.m.
You are right, the feature types are disclosed in the public.info files, there you can know how many of each type of feature will be present in the dataset. So, knowing this, and the order in which the features are loaded to build matrix X is enough to know what features correspond to what types. I will check the possibility of passing this info to the model, as for the final phase your code will run autonomously and data sets are different. This may take a while thought.Posted by: hugo.jair @ Aug. 24, 2018, 11:06 p.m.
if method in model.py can't access to self.info in DataManager,there is no need to parse *.info in DataManager.Is it right?Posted by: lsmsilence @ Aug. 28, 2018, 1:23 p.m.
I think the DataManager still needs to the *.info to load the raw csv into numpy array. It also encodes the categorical features.
What confuses me is the tips of this competition, which says:
Some basic tips for handling difficult features:
• One-hot encoding for Categorical features.
• Hashing tricks might be used for Categorical and Multi-value Categorical features.
The encoding job has already been done by the DataManager. model.py doesn't even have access to raw data. The tip seems to be uselessPosted by: iamrobot @ Aug. 29, 2018, 6:52 a.m.
yes,you are right,but in model.py,I still don't know which is one-hot feature and which is original Numerical feature.In model file,i got an numpy array with dtype=float.The value of 100 can be an ID with no Statistical significance，but some a original Number,and i can't distinguish themPosted by: lsmsilence @ Aug. 29, 2018, 8:25 a.m.
yes, that's exactly the problem I posted. I think the organizer is working on this and may take a while.Posted by: iamrobot @ Aug. 29, 2018, 1:13 p.m.
So model.py has no access neither to the datamanager.feat_types_up nor to the public.info files. I think the problem discussed here is fundamental and a faster solution suggestion from the organisers would be appreciated.Posted by: ostapeno @ Sept. 5, 2018, 12:37 p.m.