SemEval 2018 Task 4: Character Identification on Multiparty Dialogues

> data issues

During some initial evaluation of the data we have seen some fairly significant issues of missing tags in the training data. There are several POS tags that do contain tagged instances, however those same tags also have many instances that we believe should have been tagged but are not. PRP$ is a good example. There 2317 total tokes tagged as PRP$ 1331 have an entity ID and 986 have do not between these two sets they share the words {His, your, Her, her, My, yours, his, my, Your}. We believe many of the instances that are not tagged with an entity ID should have one. The other POS tags that share this issue are NN, PRP and NNP. There are others but we assume the auto generated Conll file will have some measure of error.

Posted by: casey.b @ Nov. 15, 2017, 2:33 a.m.

Sorry for the late reply; posts in the forum had not been forwarded to us. The POS tag errors are due to automatic tagging as you assumed; we tried to manually fix the errors as much as possible but have to admit some degree of errors still exists. After this competition, we will make this data publicly available so the community can help us fixing these errors. Thanks for bringing this up.



Posted by: jdchoi @ Jan. 3, 2018, 2:55 a.m.
