Criteo Privacy Preserving ML Competition @ AdKDD Forum

Go back to competition Back to thread list Post in this thread

> Congrats in advance to the winners (throwing in the towel...)

I would like the congratulate the eventual winners for setting such a high bar (or low log loss :)).

I'm looking forward to hearing the secrets of the leaders and learning how it's done.

I tried a number of things and spent in increasing time on this throughout July but just could not budge my score,
but enjoyed the process.

Eventually, I probably spent too much time on supervised learning with the 100k dataset instead of directly going off
the noisy aggregated data sets.

Probably what worked best for me was a form of what I think is called Target Encoding - from the 100k dataset I would join
with two aggregated data sets, and usually used the feature or feature cross CTR from them and replaced
the hash_ columns with all single and pair CTRs.

Using smoother CTRs didn't help that much, nor did some forms of filtering (eliminate aggregated rows with counts < 0),
or adding something to the negative counts to make them less insane.

A few days ago I got together a different approach, generating training data from the aggregated pair data, and doing supervised
training off of that and then using the 100k data set as a test set, but the metrics I got by training on the generated data were not
reproduced on the 100k set. [note: the "g" in generate here is not a GAN approach, just simple statistics)].

My best bet is that better approaches might have used some kind of Bayesian Network approach on the aggregated data, but I could
just not wrap my head around how to calculate the probabilities. Maybe there's some matrix technique too (I don't know, a big covarince
matrix) but again, I could not wrap my head around the math.

Thanks to the organizers too.

--

Rough and Random notes:

I used Colab, then Colab Pro, then Google Compute, then finally my M1 Macbook Air. It was nice seeing the smoother experience of Colab Pro, and
this was a good forcing function to figure out how to get Tensorflow installed on the M1.

From some of my lit review, it was interesting seeing the papers focusing on crosses (eg Deep Cross Network et al),
and Feature discretization / bucketing / binning (esp the paper that mentions AutoDis),
helpful to review techniques w/ unbalanced (or is it "imbalanced") data sets, though I never
got to the GAN techniques for generating minority class data.

Used lightgbm a bit but probably didn't spend enough with these decision forest techniques.
Never got to Stacking nor Distillation.

When I was post Colab, I coded up a cmd line tool similar to Uber's Ludwig and this was a pleasing technique.

This contest was a good way to get more experience with Pandas, esp groupby & aggregations.

thx again
- Dave

Posted by: knyght @ July 31, 2021, 7:51 p.m.

Thanks you @knyght for these insights !

Very interesting story. We are looking forward for more participants sharing their experience and thoughts :)

Also please consider sending pull requests to the Related Works doc of the competition if you have links to work you fond useful for solving the task
https://github.com/criteo-research/criteo-privacy-preserving-ml-competition/blob/master/RELATEDWORKS.md

Posted by: eustache @ Aug. 3, 2021, 9:44 a.m.
Post in this thread