The CODWOE shared task invites you to compare two types of semantic descriptions: dictionary glosses and word embedding representations. Are these two types of representation equivalent? Can we generate one from the other? To study this question, we propose two subtracks: a definition modeling track (Noraset et al., 2017), where participants have to generate glosses from vectors, and a reverse dictionary track (Hill et al., 2016, a.o.), where participants have to generate vectors from glosses.
Dictionaries contain definitions, such as Merriam Webster's:
cod: any of various bottom-dwelling fishes (family Gadidae, the cod family) that usually occur in cold marine waters and often have barbels and three dorsal fins
The task of definition modeling consists in using the vector representation of co⃗d to produce the associated gloss, "any of various bottom-dwelling fishes (family Gadidae, the cod family) that usually occur in cold marine waters and often have barbels and three dorsal fins". The reverse dictionary task is the mathematical inverse: reconstruct an embedding co⃗d from the corresponding gloss.
These two tracks display a number of interesting characteristics. These tasks are obviously useful for explainable AI, since they involve converting human-readable data into machine-readable data and back. They also have a theoretical significance: both glosses and word embeddings are also representations of meaning, and therefore involve the conversion of distinct non-formal semantic representations. From a practical point of view, the ability to infer word-embeddings from dictionary resources, or dictionaries from large unannotated corpora, would prove a boon for many under-resourced languages.
Here is an overview of the official results of the competition! More info can be found on the associated github.
Definition Modeling track
user /team | Rank EN | Rank ES | Rank FR | Rank IT | Rank RU |
---|---|---|---|---|---|
Locchi | 8 | 6 | 7 | ||
WENGSYX | 9 | 7 | 6 | 6 | 6 |
cunliang.kong | 3 | 2 | 3 | 1 | 2 |
IRB-NLP | 2 | 1 | 1 | 5 | 5 |
emukans | 5 | 4 | 4 | 4 | 3 |
guntis | 6 | ||||
lukechan1231 | 7 | 5 | 5 | 3 | 4 |
pzchen | 4 | 3 | 2 | 2 | 1 |
talent404 | 1 |
Reverse Dictionary track, SGNS
user / team | Rank EN | Rank ES | Rank FR | Rank IT | Rank RU |
---|---|---|---|---|---|
Locchi | 4 | 4 | |||
Nihed_Bendahman_ | 5 | 5 | 4 | 6 | 4 |
WENGSYX | 1 | 2 | 2 | 3 | 1 |
MMG | 3 | ||||
chlrbgus321 | N/A | ||||
IRB-NLP | 3 | 1 | 1 | 1 | 2 |
pzchen | 2 | 4 | 3 | 2 | 3 |
the0ne | 7 | ||||
tthhanh | 8 | 7 | 6 | 7 | 6 |
zhwa3087 | 6 | 6 | 5 | 5 | 5 |
Reverse Dictionary track, electra
user / team | Rank EN | Rank FR | Rank RU |
---|---|---|---|
Locchi | 3 | ||
Nihed_Bendahman_ | 2 | 2 | 4 |
WENGSYX | 4 | 4 | 2 |
IRB-NLP | 5 | 3 | 3 |
pzchen | 1 | 1 | 1 |
the0ne | 6 |
Reverse Dictionary track, char
user / team | Rank EN | Rank ES | Rank FR | Rank IT | Rank RU |
---|---|---|---|---|---|
Locchi | 1 | 4 | |||
Nihed_Bendahman_ | 2 | 2 | 2 | 3 | 4 |
WENGSYX | 7 | 5 | 5 | 6 | 5 |
IRB-NLP | 4 | 3 | 4 | 2 | 2 |
pzchen | 3 | 1 | 1 | 1 | 1 |
the0ne | 5 | ||||
zhwa3087 | 6 | 4 | 3 | 5 | 3 |
The data can be retrieved from our dedicated web page. See the related codalab page for more details as well.
To help participants get started, we provide a basic architecture for both tracks, a submission format checker, and the scoring script. All of this is available in our public git repository.
Keep in mind the we do not allow external data! The point is to keep results linguistically significant and easily comparable. For all details on how we will evaluate submissions, check the relevant codalab page.
Rather than focusing strictly on getting the highest scores on a benchmark, we encourage participants to approach this shared task as a collaborative research question: how should we compare two vastly different types of semantic representations such as dictionaries and word embeddings? What caveats are there? In fact, we already have a few questions we look forward to study at the end of this shared task:
These are but a few questions that we are interested in—do come up with your own to test during this shared task! To encourage participants to adopt this mindset, here are a few key elements of this shared task:
As is usual for SemEval tasks, we will release all data at the end of the shared task. Depending on participants’ consent, we also plan to collect the productions of all models and reuse them in a future evaluation campaign.
Here are the key dates participants should keep in mind. Do note that these are subject to change.
Camera-ready due date and SemEval 2022 workshops will be announced at a later date.
There’s a google group for all prospective participants: check it out at semeval2022-dictionaries-and-word-embeddings@googlegroups.com. We also have a discord server: https://discord.gg/y8g6qXakNs. You can also reach us organizers directly at tmickus@atilf.fr; make sure to mention the SemEval task in the email subject.
The evaluation script is available on our git repository for reference. Note that the complete dataset is required to run all the metrics. Metrics requiring the full dataset are indicated as such in the list below. The complete dataset will be made available at the end of the competition.
Participants may not use any external resource. This requirement is to ensure that all submissions are easily comparable. We will ask participants planning to submit a system description paper to forward a link to their code.
Participants will also be invited to contribute their systems' outputs to a dataset of system productions. The purpose of this collection of system productions is to propose them as a shared task for upcoming text generation evaluation campaigns.
Definition modeling submissions are evaluated using three metrics:
Scoring a definition modeling submission using MoverScore on CPU takes some time (15min or more). Results may not be available immediately upon submission.
Scores for distinct languages have different entries in the leaderboards, and will correspond to distinct official rankings in the task paper.
Submissions to the definition modeling track must consist of a ZIP archive containing one or more JSON files. These JSON files must contain a list of JSON objects, each of which must at least contain two keys: "id" and "gloss". The id key is used to match submissions with references. The gloss key should map to the string production to be evaluated. See our git repository for an example architecture that can output the correct JSON format.
To have your outputs scored, create a ZIP archive containing all the files you wish to submit, and upload it on CodaLab during the Evaluation phase. You can submit files for both tracks (definition modeling and reverse dictionary) at once in a single ZIP archive. Make sure that setups are unique: do not include two JSON files containing predictions for the same pair of track and language.
Do not attempt to submit glosses for different languages with a single JSON submission file. This will fail. Instead, make distinct submission files per language.
We strongly encourage you to check the format of your submission using our format checker before submitting to CodaLab. This script will also summarize how your submission will be understood by the scoring program.
Reverse dictionary submissions are evaluated using three metrics:
Scores for distinct embeddings and languages have different entries in the leaderboards, and will corresponding to distinct official rankings in the task paper.
Submissions to the reverse dictionary track must consist of a ZIP archive containing one or more JSON files. These JSON files must contain a list of JSON objects, each of which must at least contain two keys: "id" and one among "sgns", "char" or "electra", identifying which architecture your submission tries to reconstruct. The "id" key is used to match submissions with references. The other key(s) should map to the vector reconstruction to be evaluated, as a list of float components. See our git repository for an example architecture that can output the correct JSON format.
To have your outputs scored, create a ZIP archive containing all the files you wish to submit, and upload it on CodaLab during the Evaluation phase. You can submit files for both tracks (reverse dictionary and definition modeling) at once in a single ZIP archive. Make sure that setups are unique: do not include two JSON files containing predictions for the same configuration of track, language and embedding architecture.
Do not attempt to submit embeddings for different languages in a single JSON submission. This will fail. Instead, make distinct submission files per language. You may however group reconstructions for multiple architectures in a single submission file.
We strongly encourage you to check the format of your submission using our format checker before submitting to CodaLab. This script will also summarize how your submission will be understood by the scoring program.
We very strongly encourage participants to make use of the trial dataset for running manual evaluations of their systems' production. The presence of a manual evaluation in system descriptions will be taken into account during the reviewing process.
Participants should generally adopt a spirit of good sportsmanship and avoid any unfair or otherwise unconscionable conduct. We provide the following terms and conditions to clearly delineate the guidelines to which the participants are expected to adhere. Organizers reserve the right to amend in any way the following terms, in which case modifications will be advertised through the shared task mailing list and the CodaLab forums.
Participants may contact the organizers if any of the following terms raises their concern.
Participation to the competition: Any interested person may freely participate to the competition. By participating to the competition, you agree to the terms and conditions in their entirety, without amendment or provision. By participating to the competition, you understand and agree that your scores and submissions will be made public.
Scores and submissions are understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; etc.
Participants may create teams. Participants may not be part of more than one team. Teams and participants not belonging to any team must create exactly one account to the codalab competition. Team composition may not be changed once the evaluation phase starts.
Scoring of submissions: Organizers are under no obligation to release scores. Official scores may be withheld, amended or removed if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
Up to 50 submissions will be allowed during the evaluation phase. Scores will not be visible on the leaderboards until the evaluation phase is over.
Submission files will be grouped according to the track, language, and in the case of the reverse dictionary track, the embedding architecture targeted; the last submission file per group will be understood as the team's or participant's definitive submission and ranked as such in the task description paper.
Data usage: The provided data should be used responsibly and ethically. Do not attempt to misuse it in any way, including, but not limited to, reconstructing test sets, any none-scientific use of the data, or any other unconscionable usage of the data.
During the course of the shared task, participants are not allowed to use any external data. This is to ensure that results are immediately comparable. Participants will be allowed to use external data once the evaluation phase is over for system review. All data will be released at the end of the evaluation phase.
Submission of system description papers: Participants having made at least one submission during the evaluation phase will be invited to submit a paper describing their system. As a requirement, a link to the code of systems being described will be made available to organizers or the public at large. Participants submitting a system description paper will also be asked to review papers submitted by their peers in a single-blind process.
We further encourage system description papers to include a manual analysis of their systems results and productions. The presence and quality of such an analysis will be assessed during the review process. The task description paper will also devote a significant amount of space to highlighting outstanding manual evaluations conducted by participants.
Collection of system productions: Participants having made at least one submission during the evaluation phase will be invited to submit their systems' outputs to a dataset of system productions. The purpose of this collection of system productions will solely be to propose them as a shared task for upcoming text generation evaluation campaigns.
Funding Acknowledgments: This shared task was supported by a public grant overseen by the French National Research Agency (ANR) as part of the "Investissements d'Avenir" program: Idex Lorraine Université d'Excellence (reference: ANR-15-IDEX-0004).
Future sponsors, if any, will be appended to this section.
In this section, we list other relevant works on Definition Modeling and Reverse Dictionary applications.
Tom Bosc and Pascal Vincent. “Auto-Encoding Dictionary Definitions into Consistent Word Embeddings”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1522–1532. (link).
Ting-Yun Chang and Yun-Nung Chen. “What Does This Word Mean? Explaining Contextualized Embeddings with Natural Language Definition”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6064–6070. (link).
Timothee Mickus, Timothée Bernard, and Denis Paperno. “What Meaning-Form Correlation Has to Compose With: A Study of MFC on Artificial and Natural Language”. In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 3737–3749 (link).
Julien Tissier, Christophe Gravier, and Amaury Habrard. “Dict2vec : Learning Word Embeddings using Lexical Dictionaries”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 254–263 (link).
Michele Bevilacqua, Marco Maru, and Roberto Navigli. “Generationary or 'How We Went beyond Word Sense Inventories and Learned to Gloss'”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 7207–7221 (link).
Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. “Conditional Generators of Words Definitions”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 266–271.(link).
Arman Kabiri and P. Cook. “Evaluating a Multi-sense Definition Generation Model for Multiple Languages”. 2020 (link).
Thanapon Noraset et al. “Definition Modeling: Learning to define word embeddings innatural language”. In: AAAI. 2017 (link).
Liner Yang et al. Incorporating Sememes into Chinese Definition Modeling. 2019 (link).
Haitong Zhang et al. Improving Interpretability of Word Embeddings by Generating Definition and Usage (link).
Slaven Bila et al. “Dictionary search based on the target word description”. In: Proceedings of the 10th Annual Meeting of the Association for Natural Language Processing (ANLP 2004). 2004.
Hiram Calvo, Oscar Méndez, and Marco A. Moreno-Armendáriz. “Integrated Concept Blending with Vector Space Models”. In: Comput. Speech Lang.40.C (Nov. 2016), pp. 79–96. (link).
Dominique Dutoit and Pierre Nugues. “A Lexical Database and an Algorithm to Find Words from Definitions”. In: Proceedings of the 15th European Conference on Artificial Intelligence. ECAI'02. Lyon, France: IOS Press, 2002, pp. 450–454.
Ilknur Durgar El Khalout and Kemal Oflazer. “Use of Wordnet for Retrieving Words from Their Meanings”. In: Proceedings of the Second Global Wordnet Conference (GWC 2004). 2004, pp. 118–123.
Felix Hill et al. “Learning to Understand Phrases by Embedding the Dictionary”. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 17–30. (link).
Arman Malekzadeh, Amin Gheibi, and Ali Mohades. “PREDICT: Persian Reverse Dictionary”. Preprint. (link).
Oscar Méndez, Hiram Calvo, and Marco A. Moreno-Armendáriz. “A Reverse Dictionary Based on Semantic Analysis Using WordNet”. In: Advances in Artificial Intelligence and Its Applications. 2013, pp. 275–285.
Fanchao Qi et al. “WantWords: An Open-source Online Reverse Dictionary System”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, pp. 175–181.
Ryan Shaw et al. “Building a scalable database-driven reverse dictionary”. In: IEEE Transactions on Knowledge and Data Engineering 25.3 (2013), pp. 528–540.
Bushra Siddique and Mirza Mohd Sufyan Beg. “A Review of Reverse Dictionary: Finding Words from Concept Description”. In: Next Generation Computing Technologies on Computational Intelligence. 2019, pp. 128–139
Sushrut Thorat and Varad Choudhari. “Implementing a Reverse Dictionary, based on word definitions, using a Node-Graph Architecture”. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 2797–2806. (link).
Hang Yan et al. “BERT for Monolingual and Cross-Lingual Reverse Dictionary”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 4329–4338. (link).
Fabio Massimo Zanzotto et al. “Estimating Linear Models for Compositional Distributional Semantics”. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing, China: Coling 2010 Organizing Committee, Aug. 2010, pp. 1263–1271 (link).
Lei Zhang et al. “Multi-channel reverse dictionary model”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, pp. 312–319
This page lists baseline results on the development set for the two tracks, obtained with the architectures described in this sub-directory of the provided git.
MSE | Cosine | Ranking | |
---|---|---|---|
en SGNS | 0.91092 | 0.15132 | 0.49030 |
en char | 0.14776 | 0.79006 | 0.50218 |
en electra | 1.41287 | 0.84283 | 0.49849 |
es SGNS | 0.92996 | 0.20406 | 0.49912 |
es char | 0.56952 | 0.80634 | 0.49778 |
fr SGNS | 1.14050 | 0.19774 | 0.49052 |
fr char | 0.39480 | 0.75852 | 0.49945 |
fr electra | 1.15348 | 0.85629 | 0.49784 |
it SGNS | 1.12536 | 0.20430 | 0.47692 |
it char | 0.36309 | 0.72732 | 0.49663 |
ru SGNS | 0.57683 | 0.25316 | 0.49008 |
ru char | 0.13498 | 0.82624 | 0.49451 |
ru electra | 0.87358 | 0.72086 | 0.49120 |
Sense-BLEU | Lemma-BLEU | MoverScore | |
---|---|---|---|
en SGNS | 0.00125 | 0.00250 | 0.10339 |
en char | 0.00011 | 0.00022 | 0.08852 |
en electra | 0.00165 | 0.00215 | 0.08798 |
es SGNS | 0.01536 | 0.02667 | 0.20130 |
es char | 0.01505 | 0.02471 | 0.19933 |
fr SGNS | 0.00351 | 0.00604 | 0.18478 |
fr char | 0.00280 | 0.00706 | 0.18579 |
fr electra | 0.00219 | 0.00301 | 0.17391 |
it SGNS | 0.02591 | 0.04081 | 0.20527 |
it char | 0.00640 | 0.00919 | 0.15902 |
ru SGNS | 0.01520 | 0.02112 | 0.34716 |
ru char | 0.01313 | 0.01847 | 0.32307 |
ru electra | 0.01189 | 0.01457 | 0.33577 |
Start: Jan. 10, 2022, midnight
Start: Feb. 1, 2022, noon
Never
You must be logged in to participate in competitions.
Sign In