Hi, I have a question about data preprocessing in the provided code. If I've got it right, it seems that for categorical features, a new ordinal encoder will be generated to encode the feature for each new batch of data. For example, we may use encoder E1 for train1.data and another encoder E2 for test1.data. So how can we make sure that every possible categorical feature value will be encoded in the same way for different encoders? Thanks a lot~Posted by: harryfootball @ Aug. 13, 2018, 6:37 a.m.
Thank you for pointing this out. Certainly, there is no guarantee of exact correspondence of encoding-values for categorical features across batches. We proceeded this way because, given the size of some data sets, it was not feasible to load the whole data or to use other encodings that map to the same values (e.g., hashing encodings). Please note, however, that the nature of the data (which will not be revealed) makes this encoding informative (within batches, the magnitud of the value is associated to the time each attribute-value appeared). In fact, in preliminary experimentation better results were obtained using this encoding vs loading the categorical variables as integers (and it was also competitive with other encodings like hashing).
BestPosted by: hugo.jair @ Aug. 13, 2018, 8:48 p.m.
Hi, thanks a lot for your reply. Sorry but I still don't quite understand the part which you said "within batches, the magnitude of the value is associated to the time each attribute-value appeared". Does it mean that in the source data file, a larger categorical feature value corresponds to a higher probability of this category? I have tried to plot some figures to validate this idea but it didn't work out as expected. So I'm a little confused here. Would you please help explain the idea? Thank you~
Another question is that is it possible for us to do some changes to the data preprocessing part in the provided code? It seems that we can only operate on the given feature matrix X in model.py at present. Thanks a lot~
I found the code spend a lot of time when read several datasets into DataManager.
The left time for participants to process is really little. And I can only estimate(because I don't know the current dataset takes how much time to read) the left time so I can guarantee that I have enough time to predict the following batches.
Categorical variables are encoded with ordinal numbers. These numbers are assigned to values of variables as they appear chronologically in data. Hence the magnitud of the value has information of when categorical values appeared in the data (small values indicate the value appeared at the beginning of the batch, large numbers indicate the value was observed at the end of the batch). This does not solve the across batches correspondence. But at least you can be certain, that a small value in all batches correspond to values that appeared at the beginning and that large values correspond to values appearing near the end. I am sorry that we cannot say too much on the origin of data.
Regarding the other question: this is not possible, you can only interact with data via your model. We could remove columns or extract new features starting from the matrix of data.
BestPosted by: hugo.jair @ Aug. 14, 2018, 2:55 p.m.
You are right, a lot of time is spent in loading the data. This is a particularity of this challenge, and that is why we are giving more time than in previous challenges.Posted by: hugo.jair @ Aug. 14, 2018, 2:57 p.m.