From the git hub project in file semeval2018-task12-master/data/train/train-full.txt, there are several errors.
One of my issues comes in the assessment of arguments. Below is an example of one of the arguments that don't make any sense to me.
debateTitle: Unpaid intenship exploit college students
debateInfo: Should the government get tough to protect unpaid interns, or are internships a win-win?
reason: Interns are replacing employees.
claim: Do Unpaid Internships Exploit College Students?
warrant0: that harms the company's bottom line
warrant1: that helps the company's bottom line
correct warrant: 0
So first of all, the question has nothing to do with the companies bottom line but there are other such entries in the test data that do the same thing (one example is the warrants about how bike lanes in New York aren't working and then asks if it is bad or good for New York to have tourists at attractions). Are these warrants on purpose? To have warrants that are not talked about or hinted at in any of the other information and we should just try to infer using outside knowledge about each subject.
If I over look the semi disconnect between warrants and the other information, it would seem to me that having interns (free labor) would help the company's bottom line (not hurt the bottom line as the correct answer suggests). Am I wrong in thinking that this is incorrect? or what am I missing? I thought it might be the only one like this but I've found several others that I do not agree with the choosen correct answer.
Also on a side note, there are misspellings all over in the Task 12 test data, specifically with the letter t, in the debate title above there is no t in internship, and in another entry, there was no t in student.
Hi mill5970,
Thanks for your comment!
First off, apologies for my late reply but I haven't read your post until now - I though CodaLab sends notifications about new posts in this forum but it doesn't (see https://github.com/codalab/codalab-competitions/issues/1996 ).
You perhaps mixed the "debateTitle" and "claim" in the example, it should read:
> debateTitle: Do Unpaid Internships Exploit College Students?
debateInfo: Should the government get tough to protect unpaid interns, or are internships a win-win?
reason: Interns are replacing employees.
> claim: Unpaid intenship exploit college students
warrant0: that harms the company's bottom line
warrant1: that helps the company's bottom line
(see the column head in the first row of the file).
Regarding this particular case: I'm not familiar with all the data as I've mostly manually checked the dev and test data. But I do struggle with this one too.
I think the right answer should be warrant1 (not warrant0), because
- helping to increase company's total profit for free is exploitation [warrant1]
- harming company's total profit (for free) is not exploitation [warrant0]
as you also suggest.
I hope this is an exception; if you found more, please share them with us. Although the crowdsourcing process implied many quality checks, errors are obviously inevitable. I'll fix the data accordingly.
Regarding the typos: you're right, there are few typos here and there but we keep them - the data origin from discussion forums where typos are quite frequent.
Hope it helps,
Ivan
Posted by: ivan.habernal @ Oct. 9, 2017, 12:27 p.m.Hello Ivan,
looking at the training data, I'm seeing some possible non-accidental mistakes made:
8386236_62_AE861G0AY5RGT reason> "Day can has cost me a lot of money for my three children." => should be "daycare"
3649706_204_A104V8NZIQFN2F warrant0 > you can't skype or chat online for a god source of human interaction ==> should be "good source" instead of "god source"
2566415_0_A34QZDSTKZ3JO9 warrant0/warrant1 > workers will only speek freely on their own time => "speek" => "speak"
Also for 2566415_0_A34QZDSTKZ3JO9 we have warrants:
w0 > workers will only speek freely on their own time
w1 > workers can only speek freely on their own time
Also, I don't quite get the duality in the example above. Same in: 12017128_84_AB98SGS280TY5
w0> Teachers need job regulation
w1> Teachers need job regulation instead of tenure.
4594455_0_A1HKYY6XI2OHO1 Schoold day should not be longer => "Schoold" => "school"
I've noticed that "Schoold" actually appears 8 times in the train dataset. My knowledge of US affairs is limited, so just wanting to make sure if this is a typo or a known term.
Can you please take a look at these examples as well?
Apologies for the false alarm if I failed to comprehend some of the topics,
Thanks,
Filip
Hi Filip,
Thanks for catching these! Sorry for taking it so long, but I want to make sure the typos are out and went through all train/dev data and removed them (whenever it was appropriate; there are few ones remaining, mostly in proper names). It's fixed & pushed on GitHub.
These two instances were also quite weird, I fixed them too.
Best,
Ivan
Posted by: ivan.habernal @ Nov. 15, 2017, 8:56 a.m.