Today, we officially release the training and development data for SemEval Task 11: Machine Comprehension using Commonsense Knowledge. The full dataset comprises of a total of approx. 14,000 questions. The released training data contain approx. 10,000 questions on 1,500 texts, the development data contain 1400 questions on 200 texts. All texts, questions and answers were crowdsourced via MTurk. The everyday scenarios for the narratives were selected from the DeScript, OMCS and RKP corpora, and some new scenarios were added.
Note that unlike in the trial data, each question contains now only 2 instead of 4 answers. We made this change to (1) ensure that incorrect answers are less repetitive, and only the most challenging ones are presented and (2) to level out the number of answers for yes/no questions and other questions.
Because the dataset was created via crowdsourcing, a certain amount of noise cannot be ruled out. At the moment, a team of trained expert annotators inspects all instances and manually filters faulty questions or answers. Within the next weeks, we will release a validated version of the training dataset. This validated version will also contain normalized references to the text author, who is currently addressed in different ways ("the storyteller", "the narrator", "the author", ...). We are also planning to release a set of baseline systems with the validated data set.
Note also that the trial data set was recompiled to reflect the updated number of answers per question, and is now part of the training data. Submissions uploaded to the Codalab website will now be tested against the development data instead of the trial data. The test data will be released beginning of January.
Thanks for your participation, we're looking forward to your submissions!Posted by: simono @ Sept. 25, 2017, 3:12 p.m.