MIND News Recommendation Competition Forum

Go back to competition Back to thread list Post in this thread

> Confused about "Dataset Construction" in MIND paper

Hi MIND organizer,

Thank you so much for providing this wonderful dataset for students/researchers like me! I am so excited to see a high quality recommendation data like MIND.

So I have some confusions when I read the MIND paper: in "3.1 Dataset Construction", it said: we used the samples in the last week for test, and the samples in the fifth week for training. For samples in training set, we used the click behaviors in the first four weeks to construct the news click history. For samples in test set, the time period for news click history extraction is the first five weeks.

Could you please provide more explanations on how you construct the training and testing set?

Thank you!

Posted by: rachelinq816 @ Aug. 3, 2020, 6:50 p.m.

Hi, I have some following questions related to MIND dataset:

1. About "Abstract": I check the Microsoft news page and I don't find the abstract for news. Using the first example in MIND-small (url: https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata) as an example, the dataset said the abstract is "Shop the notebooks, jackets, and more that the royals can't live without." But I don't find this content on the webpage.

2. In MIND paper, Figure 2b provides the length distribution for abstract. This bimodal distribution looks very strange to me. Do you have any potential explanations for why most of the abstracts will have 20 or 80 tokens, very few of them will only have 60 tokens?

3. For "History" in "behaviors.tsv": what is the time window to collect these click history? Is the news listed in the order of the user's browsing?

4. For "Time" in "behaviors.tsv": what is the time zone?

Thank you!

Posted by: rachelinq816 @ Aug. 4, 2020, 5:40 p.m.

The abstract is automatically generated by an internal tool.

Posted by: MIND_Organizer @ Aug. 5, 2020, 1:09 p.m.

We used the click behaviors in the first four weeks to construct the news click history.

Posted by: MIND_Organizer @ Aug. 5, 2020, 1:10 p.m.

I'm not very sure about the exact time zone, but since the dataset is collected from the logs in the U.S. services, the time zone should be in the range of the time zones in the U.S.

Posted by: MIND_Organizer @ Aug. 5, 2020, 1:11 p.m.
Post in this thread