SemEval-2018 Task 9: Hypernym Discovery

Organized by CamachoCollados - Current server time: Nov. 30, 2020, 6:11 p.m. UTC


Jan. 12, 2018, midnight UTC


Jan. 30, 2018, midnight UTC


Competition Ends


This is the CodaLab Competition for the SemEval-2018 Task 9: Hypernym Discovery.

Google Group:

Main reference paper:

Camacho-Collados, Jose, Delli Bovi, Claudio, Espinosa-Anke, Luis, Oramas, Sergio, Pasini, Tommaso, Santus, Enrico, Shwartz, Vered, Navigli, Roberto, and Saggion, Horacio (2018) 
SemEval-2018 Task 9: Hypernym Discovery,
Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)


Introduction and Motivation

Hypernymy, i.e. the capability for generalization, lies at the core of human cognition. Unsurprisingly, identifying hypernymic relations has been pursued in NLP for approximately the last two decades, as successfully identifying this lexical relation contributes to improvements in Question Answering applications (Prager et al. 2008; Yahya et al. 2013) and Textual Entailment or Semantic Search systems (Hoffart et al 2014; Roller and Erk 2016). In addition, hypernymic (is-a) relations are the backbone of almost any ontology, semantic network and taxonomy (Yu et al. 2015; Wang et al. 2017), the latter being a useful resource for downstream tasks such as web retrieval, website navigation or records management (Bordea et al 2015).

Hypernym Discovery: What is New?

Traditionally, the task of identifying hypernymic relations from text corpora has been evaluated within the broader task of Taxonomy Evaluation (e.g. SemEval-2015 task 17, SemEval-2016 task 13). Alternatively, many approaches have been specializing on Hypernym Detection, i.e. the binary task consisting of, given a pair of words, deciding whether a hypernymic relation holds between them or not. This expermental setting has already led to criticisms regarding its alleged oversimplification (Levy et al 2015; Shwartz et al 2017; Camacho-Collados et al 2017).

Inspired by recent work (Espinosa-Anke et al 2016) we propose to reformulate the problem as Hypernym Discovery, i.e. given the search space of a domain’s vocabulary, and given an input concept, discover its best (set of) candidate hypernyms. In addition to making the task more realistic in terms of actual downstream applications, this novel approach also opens up complementary evaluation procedures by enabling, for instance, Information Retrieval evaluation metrics (click on the Evaluation tab for detailed information).

In short:

  • General-Purpose Hypernym Discovery on three languages (English, Spanish, Italian)

  • Domain-Specific Hypernym Discovery on two domains (Medicine, Music)

Contact Info:

Jose Camacho-Collados
Claudio Delli Bovi
Tommaso Pasini
Roberto Navigli
Sapienza University of Rome

Vered Shwartz
Bar-Ilan University

Luis Espinosa-Anke
Sergio Oramas
Horacio Saggion
Universitat Pompeu Fabra

Enrico Santus
Singapore University

Contact emails:

- collados [at] di [dot] uniroma1 [dot] it 
- luis.espinosa [at] upf [dot] edu


SemEval-2018 Task 9 Sponsors


Georgeta Bordea, Paul Buitelaar, Stefano Faralli, and Roberto Navigli. 2015. Semeval-2015 task 17: Taxonomy extraction evaluation (Texeval). In Proceedings of the SemEval workshop.

Jose Camacho-Collados. 2017. Why we have switched from building full-fledged taxonomies to simply detecting hypernymy relations. arXiv preprint arXiv:1703.04178.

Luis Espinosa-Anke, Jose Camacho-Collados, Claudio Delli Bovi, and Horacio Saggion. 2016. Supervised distributional hypernym discovery via domain adaptation. In Proceedings of EMNLP, pages 424–435.

Johannes Hoffart, Dragan Milchevski, and Gerhard Weikum. 2014. Stics: searching with strings, things, and cats. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, pages 1247–1248.

Omer Levy, Steffen Remus, Chris Biemann, Ido Dagan, and Israel Ramat-Gan. 2015. Do supervised distributional methods really learn lexical inference relations? In Proceedings of NAACL, pages 970–976.

John Prager, Jennifer Chu-Carroll, Eric W Brown, and Krzysztof Czuba. 2008. Question answering by predictive annotation. In Advances in Open Domain Question Answering, Springer, pages 307–347.

Stephen Roller and Katrin Erk. 2016. Relations such as Hypernymy: Identifying and Exploiting Hearst Patterns in Distributional Vectors for Lexical Entailment. In Proceedings of EMNLP, pages 2163–2172.

Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of EACL, pages 65–75.

Chengyu Wang, Xiaofeng He, and Aoying Zho. 2017. A Short Survey on Taxonomy Learning from Text Corpora: Issues, Resources and Recent Advances. In Proceedings of EMNLP, pages 1201–1214

Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, and Gerhard Weikum. 2013. Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, pages 1107–1116.

Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang. 2015. Learning Term Embeddings for Hypernymy Identification. In Proceedings of IJCAI, pages 1390-1397.

Task Details

The Hypernym Discovery task consists of, given an input term, finding its most appropriate hypernym(s) in a pre-defined corpus (see Data and Resources for detailed information). This specific task consists of five independent (participants are allowed to submit systems on any individual subtask) but related subtasks, which are split into two larger groups, i.e. general-purpose hypernym discovery and domain-specific hypernym discovery:


Subtask 1: General-Purpose Hypernym Discovery

This subtask consists of discovering hypernyms in a general-purpose corpus. Therefore, in this case systems require the flexibility to provide hypernyms for terms in a wide range of domains. In this task we provide data for three different languages: English (subtask 1A), Italian (subtask 1B) and Spanish (subtask 1C).

Subtask 2: Domain-Specific Hypernym Discovery

In contrast, this subtask deals with specific domains, namely  medical (subtask 2A) and music (subtask 2B) domains. In this case participants test their systems (which may be general or specifically tailored to a target domain) in a much more focused and reduced environment.



The Hypernym Discovery task is especially targeted to evaluate both hypernym extraction and detection systems, as well as taxonomy learning and entity typing systems. Participants from all these areas are encouraged to participate. The task may be additionally viewed as a proxy for downstream applications, such as Information Extraction or Question Answering (e.g. what is the highest mountain in Africa?), that require specific knowledge from is-a relations, as well as a reliable evaluation benchmark for the first step of ontology learning systems, since is-a relations generally constitute their backbone.


Terms and Conditions

By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.

You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.

You agree not to redistribute the test data except in the manner prescribed by its licence.

Data and Resources


[NEW!] Training and testing data: Testing and training data for all subtasks, including gold hypernyms, vocabularies and an evaluation script.

Information and direct links to download the corresponding corpus for each subtask can be found below. More information about the evaluation and how to participate on the "Evaluation" tab. 


For testing, systems are provided with terms for which they have to produce a ranked list of their extracted hypernyms. The gold standard consists of terms along with their corresponding hypernyms (up to trigrams). Training and testing data are split evenly (50% training - 50% testing). More information and direct links to corpora are reported below.


Subtask 1: General-Purpose Hypernym Discovery

General-purpose corpora. For the first subtask we use the 3-billion-word UMBC corpus (Han et al. 2013), which is a corpus composed of paragraphs extracted from the web as part of the Stanford WebBase Project. This is a very large corpus containing information from different domains. For Italian we use the 2-billion-word itWac corpus (Baroni et al. 2014), extracted from different sources of the web, and for Spanish a 1-billion-word Spanish corpus (Cardellino 2016), which also contains documents from different sources. Details about the corpora (including direct links for download) are summarized in the table below:

Subtask      Corpus Description Links
1A: English     3B-word UMBC corpus extracted from the web (Han et al. 2013). Temporary link to download the original PoS tagged corpus.

Tokenized [6.2GB] 


1B:   Italian     1.3B-word ItWac corpusextracted from the web (.it domain) (Baroni et al. 2014) 

Tokenized [2.6GB]


1C: Spanish     1.8B-word corpus extracted from various sources (Wikipedia, Europarl, AnCora, etc.) (Cardellino 2016)

Tokenized [3.2GB]












 Input terms. We provide a balanced set of terms, with different degrees of frequency and for different domains. For English 3000 terms with their corresponding hypernyms are provided (around 10000 term-hypernym pairs), while for Spanish and Italian 2000 terms each.

Gold standard. The gold standard consists of input terms given to the systems (see above) and gold hypernyms extracted from multiple resources and manually validated (for both training and testing). See the table below for some examples.

  Term Hypernym(s) Source
English (general) dog canine, mammal, animal WordNet
Spanish (general) guacamole salsa para mojar, alimento, salsa Wikidata
Italian (general) Nina Simone musicista, pianista, persona MultiWiBi





Subtask 2: Domain-Specific Hypernym Discovery

Domain-specific corpora. For the medical domain a combination of abstracts and research papers provided by the MEDLINE (Medical Literature Analysis and Retrieval System) repository, which contains academic documents such as scientific publications and paper abstracts, is provided. As regards the music domain, the provided corpus is a concatenation of several music-specific corpora, i.e., music biographies from contained in ELMD 2.0 (Oramas et al. 2016), the music branch from Wikipedia, and a corpus of album customer reviews from Amazon (Oramas et al. 2017). Details about the corpora (including direct links for download) are summarized in the table below:

Subtask       Corpus Description


2A: Medical    
      130M-word subset extracted from the PubMed corpus of biomedical literature from MEDLINE, distributed by the National Library of Medicine (updated 10 Sept, some duplicated texts have been removed)

Tokenized [258MB]


2B:   Music        100M-word corpus including Amazon reviews, music biographies and Wikipedia pages about theory and music genres (Oramas et al. 2016)

Tokenized [200MB]









 Input terms. As in the previous subtask, we provide a balanced set of terms, with different degrees of frequency and for different sub-domains. We provide around 1000 terms for each domain (clinical and music).

Gold standard. In this case, we use the same procedure described in Section 2.1 restricted to the target domain, and, in addition, domain-specific taxonomies. See the table below for some examples.

  Term Hypernym(s) Source
English (clinical)       pulmonary embolism

disorder of pulmonary circulation, trunk arterial embolus, disorder, embolism

English (music)    Green Day artist, rock band MusicBrainz


Data Availability and Copyright

All task participants are provided with trial, training and test sets for each of the subtasks. These datasets will be released under the Creative Commons License Attribution-ShareAlike 3.0 Unported License. The data are extracted semi-automatically, pre-processed and validated by experts, most of them being in the organizing team. We intend to use only data that are openly available, so no additional licenses or permissions need to be acquired for the resources as a part of this task.



Marco Baroni, Silvia Bernardini, Adriano Ferraresi and Eros Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation, 43(3): 209-226.

Cristian Cardellino. 2016. Spanish Billion Words Corpus and Embeddings (March 2016),

Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan Weese. 2013. UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems. In Proceedings of *SEM.

Sergio Oramas, Luis Espinosa Anke, Mohamed Sordo, Horacio Saggion and Xavier Serra. 2016. ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain. In Proceedings of LREC.

Sergio Oramas, Oriol Nieto, Francesco Barbieri, and Xavier Serra (2017). Multi-label Music Genre Classification from Audio, Text, and Images Using Deep Features. In Proceedings of the 18th Conference of the International Society of Music Information Retrieval (ISMIR 2017).



  •  [NEW!] Training and testing data: Testing and training data for all subtasks, including gold hypernyms, vocabularies and an evaluation script.
Find the corresponding corpus for each subtask in the "Data and Resources" tab (Learn the Details -> Data and Resources).

For each subtask and setting we provide a list of input terms (hyponyms) as well as a large vocabulary extracted from each corpus (see Data and Resources section). Participants are expected to deliver, for each input term, a ranked list of candidate hypernyms (up to 15) from the provided vocabulary.

How to submit: compress all your submission files in a .zip archive directly (no intermediate folders) using the file names "1A.english.output.txt", "1B.italian.output.txt", "1C.spanish.output.txt", "2A.medical.output.txt" and "" (or any subset of these, according to the subtasks you are planning to participate in). In each submission file we expect a ranked list of hypernyms (tab-separated) for each line, such that line numbers are aligned with the corresponding "data" file. You can leave a blank line when you do not want to provide hypernyms for a given term.

- Note #1: all subtasks are independent. While we encourage all participants to submit systems on all languages and domains, participants are allowed to submit results on the subtasks they consider appropriate (for example only on subtask 1A or subtask 2B).

- Note #2: input terms and vocabularies include both words and multi-word expressions up to trigrams.

- Note #3: input terms consist of both concepts and named entities (specified with the labels "Concept" and "Entity" respectively). It is possible to participate on one of the two types only (concepts->subclass of relations; entities->instance of relations). If so, please indicate it at submission time. For the submission you can simply leave empty the lines corresponding to named entities or concepts, as it corresponds. 

- Note #4: even though the input terms format will have case distinctions (i.e. upper- or lowercase), the evaluation is case-insensitive.

- Note #5: participants are allowed to submit a maximum of two systems/runs.


Evaluation Criteria

For the evaluation we utilize the following Information Retrieval measures (for all measures only the top 15 extracted hypernyms will be considered):

  • Mean Reciprocal Rank (MRR): the average of the reciprocal ranks of the first correctly retrieved hypernym.  
  • Precision@k (P@k): a classic IR measure which evaluates the correctness of a retrieved hypernym in a predefined position in the ranked list of candidates. We plan to set k to 1, 3, 5 and 15;
  • Mean Average Precision @15 (MAP): a complementary metric to MRR that takes into account whether hypernyms were retrieved within the k first positions in all Precision@k measures (1<=k<=15). This will be the measure displayed on the leaderboard, but please note that all the measures are complementary and equally valid depending on the purpose.

Manual evaluation of false positives

Following Bordea et al. (2015), where an additional manual evaluation was carried out on novel relations (i.e. edges encoded on novel terminology not part of the initial task data) in different domains, this task will also feature a manual evaluation over a sample of theoretical false positives.


We plan to release simple unsupervised and supervised baselines to be used as reference for each kind of model.

Cross-evaluation for supervised systems (optional)

We kindly encourage our participants submitting supervised systems in subtask 1A (English general) to also send us the results of their systems trained on the data of subtask 1A but tested on the domain-specific data (subtasks 2A and 2B). These results will not be considered for the official rankings but will be analyzed and will help us understand the degree to which general-purpose trained systems behave on different domains. These results should be sent by email (titled "SemEval 2018: Cross-evaluation") to the organizers, containing two files with the output on the test data of subtasks 2A and 2B, respectively.



Georgeta Bordea, Paul Buitelaar, Stefano Faralli, and Roberto Navigli. 2015. Semeval-2015 task 17: Taxonomy extraction evaluation (Texeval). In Proceedings of the SemEval workshop.

The official results of the SemEval 2018 task on Hypernym Discovery can be downloaded here:
We have split the results by dataset in tsv (tab-separated) files, where rows correspond to systems' runs. Each system is represented by its team name or CodaLab username (when the team name was not available). Systems are sorted by MAP percentage, but all other measures (MRR and P@K) are also displayed.


  •  Please note that even though all the results have been released together (the general results and split by concept/entity), the nature of the systems has also been indicated in the second and third columns (unsupervised/supervised and external resources used), information which will be used for making analysis and ranks for each kind of system. The systems ending in "CROSSEVAL" in the music and medical datasets correspond to those supervised systems which were trained on the English training data (subtask 1A).
  • We have included a set of strong supervised and unsupervised baselines as a reference. As supervised baseline we have included a vanilla version of TaxoEmbed (Espinosa-Anke et al. 2016), using Word2Vec word embeddings trained on the provided corpora and training a transformation matrix on the training data, with the corresponding word embeddings as features. We have also included a baseline based on the Most Frequent Hypernyms (MFH) in the training data. As unsupervised baselines we used three hypernymy detection techniques described in Shwartz et al (2017) adapted to this task;
  • As mentioned in the task description, we have performed a manual evaluation of false positives. For this we have extracted, for each system, 50 random errors of the first given hypernyms (50% concepts - 50% entities) and evaluated by human annotators. These results are displayed in the last column of each file and correspond to the percentage of correctly retrieved hypernyms on the extracted random sample.


Task Description Reference:

  title={{SemEval-2018 Task 9: Hypernym Discovery}},
  author={Camacho-Collados, Jose and Delli Bovi, Claudio and Espinosa-Anke, Luis and Oramas, Sergio and Pasini, Tommaso and Santus, Enrico and Shwartz, Vered and Navigli, Roberto and Saggion, Horacio},
  booktitle={Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)},
address={New Orleans, LA, United States},
publisher = {Association for Computational Linguistics}


Luis Espinosa-Anke, Jose Camacho-Collados, Claudio Delli Bovi, and Horacio Saggion. 2016. Supervised distributional hypernym discovery via domain adaptation. In Proceedings of EMNLP, pages 424–435.

Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of EACL, pages 65–75.


Start: Jan. 8, 2018, midnight


Start: Jan. 12, 2018, midnight


Start: Jan. 30, 2018, midnight

Competition Ends


You must be logged in to participate in competitions.

Sign In