we just released a validated version of the training and development data. A set of trained annotators inspected all texts and questions and checked if the correct and incorrect answers are appropriate. Approx. 11.5% of questions were corrected. 135 Questions were excluded from the data because no appropriate correct or incorrect answer could be found.
We also automatically replaced the most common paraphrases of the narrator (storyteller, story teller, speaker, author, protagonist, person telling the story, person who told the story, person who wrote the text) with "narrator".
The text and question IDs did not change, but we re-shuffled the answers. So, please download the new, validated data sets and re-apply your models!Posted by: simono @ Oct. 17, 2017, 1:53 p.m.
instance id='784' question id ='1' both answer were truePosted by: zpchen @ Oct. 19, 2017, 8:04 a.m.
thanks for paying attention! I fixed the error and re-uploaded the data. This was the only flawed instance from what I could see.
I have a suggestion that the task 11 data should provide human performance . So that we can know the distance between human and our model.Posted by: zpchen @ Oct. 30, 2017, 7:54 a.m.
Yes, good suggestion! We did indeed plan to provide a human upper bound and are currently working on it. The upper bound will at least be provided for the test data.Posted by: simono @ Nov. 7, 2017, 7:47 a.m.