MIND News Recommendation Competition Forum

Go back to competition Back to thread list Post in this thread

> Data Leakage problem

In this competition, actually we can use "future" data to predict the history result (Eg. we use 10 a.m.'s data to predict a 7 a.m. sample). This is a typical data leakage problem. What I want to ask does it legal?

Posted by: YangZhenghong @ Aug. 25, 2020, 6:09 a.m.

Hi YangZhenghong, as described in "MIND: A Large-scale Dataset for News Recommendation", for samples in the training set, we used the click behaviors in the first four weeks to construct the news click history and for samples in the test set, the time period for news click history extraction is the first five weeks. Thus, there should be no data leakage problem. If you find any real cases of data leakage in the data, welcome to send emails to mind@microsoft.com.

Posted by: MIND_Organizer @ Aug. 26, 2020, 5:34 a.m.

Hello, MIND organizer.Thanks for your detailed reply on the data set. I have verified there is no data leakage problem on the training set by intersecting the history clicked news with impressions. But there are about 5 percentage samples with the potential data leakage risk on the test set. Because I can get the same news id from the history clicked news with impressions on these samples. But I can't verify whether these samples are leaked.
Some related sample's user id is :

Posted by: YangZhenghong @ Aug. 27, 2020, 4:45 a.m.

Hi YangZhengHong, the samples that appear in both history and impression logs are not leaked. Some news articles may be clicked by a user multiple times (e.g., N45124 for U626729).

Posted by: MIND_Organizer @ Aug. 27, 2020, 7:22 a.m.
Post in this thread