SemEval-2019 Task 12 - Toponym Resolution in Scientific Papers Forum

Go back to competition Back to thread list Post in this thread

> Questions about the data and the baseline code

We have found some mistakes in the data. Some entities in annotation files are not consist with the words in raw text. Some words are partial marked as an entity. For example, "RSA" in word "RSAa" (PMC2732512.txt 12969 12973),"Cologne" in word “Cologne’s" (PMC2828076.txt 6408 6417), "Denmark" in word "Demark13" (PMC3820476.txt 3622 3631) and so on. Maybe these annotations are reasonable because the annotated parts are reasonable toponyms. However, the toponym detection models are token based. They always tag a whole word as a toponym, but not part of a word. For example, the model may tag the word “Demark13” as a toponym but not “Demark”. These situations are treated as errors in the official evaluation script(ToponymEvaluator.py). So do you have a plan to improve the data quality? (Maybe time is not enough.)
Besides, I’m afraid that the evaluation script is not perfect. The function __getExtendedText__ only insert "\n" into multiple span annotation. For instance, “Ham-\nburg" (PMC4907427, 41948 41957) was annotated as "41948 41951;41953 41957" with the hypen skipped. This function will return a string "Ham\nburg" which mismatches the raw text and cause an error. So may I ask you to review the script?

Posted by: VincentZheng @ Jan. 4, 2019, 12:31 a.m.

Dear Vincent,

Regarding your first question about the data, after discussion with our annotators, cases such as "Denmark" in token "Denmark13" (PMC3820476.txt 3622 3631) are not considered as an annotation errors since the toponym expected is Denmark and not Denmark13, idem for your other examples. A clever system should be able to return the position of Denmark in Denmark13. The problem, here, is rather to create a tokenizer able to handle those cases by creating 2 tokens one for Denmark and one for 13. The default tokenizer based on space separation will works not so well with scientific articles and even less with pdf conversions. That said, I am not sure if all annotators annotated all nested toponyms in the same way, I am still investigating the problem and will update my answer.

Regarding your second question, thanks for noticing the problem. The bug is actually not in the evaluation script but most likely in the Brat annotator. In some cases, not all, hyphens before a new line are removed in the texts associated with the annotations. The positions remains correct but the contents of the strings are not. I have added few lines of code in the evaluation script to handle these cases. The evaluation script, as you have seen, is just comparing the positions of the gold and predicted annotations, if the positions are identical it is just proceeding to a sanitary comparison between the texts considered in the gold and in the predicted annotations, they should be identical (since their positions are the same). If they are not, then all spaces, new lines and hyphens are removed and the strings are compared again (Brat seems to replace sometimes newline by spaces when it rewrites the texts of the annotations). I checked that if a system is guessing all positions correctly and rewrite the texts from the .txt at these positions in the .ann generated, the system will get a perfect score on the test set (i.e. just rewriting the gold annotation with the texts extracted from the converted pdf at the positions and not with the texts rewritten by brat, then running the evaluation script against the "predicted" annotations). I have updated the evaluation script on codalab and in my bitbucket if you want to try the new version.

I hope this help.

Best regards,
Davy

Posted by: dweissen @ Jan. 7, 2019, 9:14 p.m.

Hi Davy, thank you for replying. And I have some another questions about the gold standard annotations of the development set.
Below I've listed a sample of some cases where it is sometimes not clear why certain locations were annotated as such and others not.

1.Why “Northern Hemisphere” is annotated as "LOC" in PUB20167132.txt, but not annotated in PMC2732512.txt
Filename: PUB20167132.txt
Context: Previous studies revealed high virus prevalence during the autumn season in the Northern Hemisphere [16]
Filename: PMC2732512.txt
Context:Although bird deaths characterized recent epidemics in the Northern Hemisphere, the disease appears to have spared Afri-can species in a New York zoo (19,20,40).

2.Why “the Americas” has three kinds of annotations?
If we use "IOB" format, In PMC4907427.txt ,“the Americas” is annotated as "O B-LOC". And in PMC4479511.txt ,“the Americas” is annotated as "O O".But in PUB21392430.txt, “the Americas” is annotated as "B-LOC I-LOC".
Filename: PMC4907427.txt
Context: Related HCPS-causing hantaviruses are found throughout the Americas.
Filename: PMC4479511.txt
Context: In the Americas, they are responsible for hantavi-rus cardiopulmonary syndrome with relatively high case fatality rate in humans.
Filename: PUB21392430.txt
Context: Serologic studies for swine influenza viruses (SIVs) in humans with occupational exposure to swine have been reported from the Americas but not from Europe.

3. In PUB20975994.txt,"Southern Finland"," Eastern Finland"," Western Finland" are annotated as "B-LOC, I-LOC",but “Northern Finland” is annotated as "O B-LOC"。And why aren't the abbreviations of "Southern Finland"(SF)," Eastern Finland"(EF)," Western Finland" (WF)
annotated as "LOC" ?
Filename: PUB20975994.txt
Context: In addition, the following abbreviations are used: SF ­ Southern Finland, EF ­ Eastern Finland, WF ­ Western Finland, NF ­Northern Finland (including Oulu and Lapland districts).

The same cases like "Western Europe" is annotated as "LOC" in PUB21392430.txt, but not in PMC4009295.txt. And "Southeast Asia" is annotated as "LOC" in PUB20167132.txt, but not in PMC3952587.txt.

4. Why are "Kansas State University's biosafety level 2 (BSL-2) facility" and "BSL-3 facility " annotated as “LOC”, as Names of organizations should not be tagged accordding to the annotation guidelines.
Filename:PUB21900171.txt
Context: These studies included two experiments: the classical H1N1 SIV (IA30) study was completed at Kansas State University's biosafety level 2 (BSL-2) facility in compliance with the Institutional Animal Care and Use Committee at Kansas State University, and the pH1N1 virus study was completed at the Central States Research Center (CSRC), Inc., BSL-3 facility (Oakland, NE), in compliance with the InstitutionalAnimal Care and Use Committee at CSRC.

5. In document "23029335.txt": some countries fullname are tagged while the abbreviation of the that country is not.
Filename: 23029335.txt
Context: TW: Taiwan; IND: India; RUS: Russia; CAN: Canada."

6.In documant "23029335.txt": in "May-September 2009 for India, and June-September 2009 for Taiwan", the reference annotations tag "Taiwan" as "Location" while the counterpart "India" which in the same sentence is not.
Filename:23029335.txt
Context: Despite approximately the same time-frame between the two studies (May-September 2009 for India, and June-September 2009 for Taiwan), only six mutations were found in common between 16 mutations detected from 13 Indian viruses and 26 mutations from 39 Taiwanese viruses (shown in Table 7)

Since there are some annotation errors in the train and eval corpus, I want to make sure whether you will correct the annotation mistakes in the test corpus?

Posted by: VincentZheng @ Jan. 8, 2019, 11:22 a.m.

Dear Vincent,

Update for the question regarding the nested toponyms (e.g. Denmark13): we have checked that all nested toponyms are annotated in the same way in the test set, that is, for all instances such as Denmark13, RSAa etc. only the nested toponyms are annotated (in our example Denmark and RSA are both annotated as toponyms)

Regarding the next questions, please see the answer given by our annotators inline:
1.Why “Northern Hemisphere” is annotated as "LOC" in PUB20167132.txt, but not annotated in PMC2732512.txt
Filename: PUB20167132.txt
Context: Previous studies revealed high virus prevalence during the autumn season in the Northern Hemisphere [16]
Filename: PMC2732512.txt
Context:Although bird deaths characterized recent epidemics in the Northern Hemisphere, the disease appears to have spared Afri-can species in a New York zoo (19,20,40).
-> It is a FN for PMC2732512, it should be annotated.

2.Why “the Americas” has three kinds of annotations?
If we use "IOB" format, In PMC4907427.txt ,“the Americas” is annotated as "O B-LOC". And in PMC4479511.txt ,“the Americas” is annotated as "O O".But in PUB21392430.txt, “the Americas” is annotated as "B-LOC I-LOC".
Filename: PMC4907427.txt
Context: Related HCPS-causing hantaviruses are found throughout the Americas.
Filename: PMC4479511.txt
Context: In the Americas, they are responsible for hantavi-rus cardiopulmonary syndrome with relatively high case fatality rate in humans.
Filename: PUB21392430.txt
Context: Serologic studies for swine influenza viruses (SIVs) in humans with occupational exposure to swine have been reported from the Americas but not from Europe.
-> Following our guidelines, the correct sequence is O B-Loc, only Americas is a toponym, the definite article should be ignored. We have here one FN "O O" and one incorrect annotation "B-Loc I-Loc".

3. In PUB20975994.txt,"Southern Finland"," Eastern Finland"," Western Finland" are annotated as "B-LOC, I-LOC",but “Northern Finland” is annotated as "O B-LOC"。And why aren't the abbreviations of "Southern Finland"(SF)," Eastern Finland"(EF)," Western Finland" (WF)
annotated as "LOC" ?
Filename: PUB20975994.txt
Context: In addition, the following abbreviations are used: SF ­ Southern Finland, EF ­ Eastern Finland, WF ­ Western Finland, NF ­Northern Finland (including Oulu and Lapland districts).
The same cases like "Western Europe" is annotated as "LOC" in PUB21392430.txt, but not in PMC4009295.txt. And "Southeast Asia" is annotated as "LOC" in PUB20167132.txt, but not in PMC3952587.txt.
-> These annotations are correct but need some explanations since they do not seem intuitive. Following our guidelines adjectives are excluded from the annotation, ex. south China should be annotated "O B-Loc" but not when the adjective is a part of the toponym ex. North Carolina, annotated "B-Loc I-Loc". The problem starts with what is accepted as "part of the toponym", one may argue that Western Europe as well as Southeast Asia are well accepted names and should be considered as toponyms and therefore both annotated as "B-Loc I-Loc". To avoid the problem we made the following choice: if the sequence is in GeoNames it is accepted as "part of the toponym", if not, the usage is not well accepted and the adjective should be ignored. For the case of Finland we have "Southern Finland"," Eastern Finland"," Western Finland" occurring in GeoNames but not Northern Finland, therefore the annotations are coherent with the guidelines.

4. Why are "Kansas State University's biosafety level 2 (BSL-2) facility" and "BSL-3 facility " annotated as “LOC”, as Names of organizations should not be tagged accordding to the annotation guidelines.
Filename:PUB21900171.txt
Context: These studies included two experiments: the classical H1N1 SIV (IA30) study was completed at Kansas State University's biosafety level 2 (BSL-2) facility in compliance with the Institutional Animal Care and Use Committee at Kansas State University, and the pH1N1 virus study was completed at the Central States Research Center (CSRC), Inc., BSL-3 facility (Oakland, NE), in compliance with the InstitutionalAnimal Care and Use Committee at CSRC.
-> These are other difficult cases to define and were subject to discussion between annotators. One annotator may argue that 1. it is a name of an institution and not a name of a place and should not be annotated, another annotator may answer that 2. in this specific instance, the physical place is actually mentioned making the expression more toponym than name of institution. The guidelines have evolved between the first of set of articles released in 2015 and the new articles annotated for SemEval'19. In the first version of the guideline these cases were handled by rule 2. whereas in the new batch rule 1. was applied. For the test set rule 1. was applied and your examples would be considered as FPs. Such cases are rare in the corpus.

5. In document "23029335.txt": some countries fullname are tagged while the abbreviation of the that country is not.
Filename: 23029335.txt
Context: TW: Taiwan; IND: India; RUS: Russia; CAN: Canada."
-> See number 3.

6.In documant "23029335.txt": in "May-September 2009 for India, and June-September 2009 for Taiwan", the reference annotations tag "Taiwan" as "Location" while the counterpart "India" which in the same sentence is not.
Filename:23029335.txt
Context: Despite approximately the same time-frame between the two studies (May-September 2009 for India, and June-September 2009 for Taiwan), only six mutations were found in common between 16 mutations detected from 13 Indian viruses and 26 mutations from 39 Taiwanese viruses (shown in Table 7)
-> This is a FN, India should be annotated as a location.

Hope this help.

Best regards,
Davy

Posted by: dweissen @ Jan. 9, 2019, 6:03 p.m.

Thanks Davy. So the above FN annotation errors have been corrected or will be corrected for the test corpus?

Posted by: VincentZheng @ Jan. 10, 2019, 7:14 a.m.

The test corpus has been annotated independently by two annotators and compare to ensure coherence with the guidelines and to reduce human errors. We also run several automatic sanity checks on the test data and corrected the anomalies found. This should increase its quality, but unfortunately no corpus is without errors. On the other side the errors will impact all participating systems in the same way.

Best regards,
Davy

Posted by: dweissen @ Jan. 10, 2019, 4:31 p.m.
Post in this thread