Aggregated clicks and sales have negative values. That's the result of applying noise, correct? Obviously counts cannot be negative, so we can expect lower values being closer to 0 on average, correct?

What about values of the features? Should I assume that one value being bigger doesn't mean the same for unhashed values?

re: noisy counts your interpretation is right. There is always a low probability that the original count (before noise injection) was high but on average a negative value is indeed a sign that the count was close to 0.

re: feature hashing, there is no guarantee that it preserves monotonicity. The safest interpretation is to treat each hashed value as a different input signal or token.

Posted by: eustache @ July 5, 2021, 7:09 a.m.I do not understand what's the purpose of 2 aggregate files: why can't we count the values from "X_train.csv.gz"?

Posted by: tetelias @ July 5, 2021, 4:41 p.m.So as stated in the docs, the X_train is small (100k samples) while the aggregated_noisy_data_singles/pairs are computed on a much larger sample (100M). Hence you will observe more modalities of P(X) in the aggregated data than in the small, granular one.

Posted by: eustache @ July 6, 2021, 7:50 a.m.