On behalf of team MLHuskies, I want to thank the organizers for such a fun and interesting competition! In trying to improve our solutions, we learned many techniques we didn't know we didn't know. Learning the techniques and tricks behind the winning solutions will no doubt be another step in that journey ;-) Among other things, we are curious to see if someone found a good way to improve the loss by enforcing consistency constraints on the noisy data.
We started out by training a Naive Bayes classifier directly on the aggregated singles data. The results were not great, likely in part because the probabilities were not well calibrated. We thought about mitigating that, and also about training tree-augmented Naive Bayes classifiers to pull in info from the aggregated pairs data. Those remained ideas because at that point we came across neural network based solutions (Deep Crossing from Microsoft, DLRM from Facebook etc.) and believed that was the way to go. In the end, our best solutions were with lightgbm models combined with (hierarchical) target encoding. We never made any attempts to generate synthetic data based on the aggregated statistics.
Thanks again to the organizers for the fun ride, and our congratulations to the winners. We are really looking forward to hearing from you how it's done!!
Martine
Posted by: mdecock @ Aug. 3, 2021, 8:05 p.m.