Hi,
This thread is dedicated to discussing/proposing ideas on how to solve the challenge.
This way new challengers can pick some ideas and veterans can discuss merits of different ideas :)
Posted by: eustache @ March 26, 2021, 10:32 a.m.Idea #1: feature quantization/bucketization
A simple idea is to pick a few predictive dimensions of the data and bucketize them - this is what the starting script "my_first_submission.py" is doing, but with only 1 feature.
One could argue that it would help prediction as we can see on the following links:
- https://developers.google.com/machine-learning/data-prep/transform/bucketing (on housing dataset)
- https://arxiv.org/abs/2004.02088 (it even seems to improve GAN training)
It could also improve privacy as the more individuals would share the same anonymized data representation (all the records that fall into the same buckets/bins)
Posted by: eustache @ March 26, 2021, 10:39 a.m.Idea #2: Noise Injection
The idea of injecting noise is probably one of the ways to improve (=degrade) the privacy loss. Indeed, with a bit of noise it will be more difficult for the privacy attacker to match different records in the dataset with the same user.
A popular/trendy way to do so is to inject random noise e.g. from a Gaussian or Laplace distribution, as is recommended by Differential Privacy theory. Here is an introduction to this concept:
https://medium.com/georgian-impact-blog/a-brief-introduction-to-differential-privacy-eacf8722283b
One could for instance add noise to all or some features in the data representation or even add random variables to fool the privacy attacker learning algorithm !
If you play with the idea don't hesitate to post your thoughts in the forum :)
Posted by: eustache @ March 30, 2021, 12:13 p.m.Idea #3: use embeddings as data representation
The idea is to use intermediary (hidden) layers of a neural network as data representation.
For instance, one may learn a Multi Layer Perceptron with a cross-entropy loss and use the last hidden layer as representation.
A popular instance of this technique for Natural Language Processing is Word2vec [1], but it can be used with whatever input data you like. Modern neural net toolboxes such as Keras or PyTorch make it easy to experiment with this technique [2].
This idea can be of course combined with other strategies like noise injection or bucketization.
Happy hacking/learning !
[1] https://en.wikipedia.org/wiki/Word2vec
[2] https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526