CODWOE - Comparing Dictionaries and Word Embeddings

Organized by tmickus - Current server time: March 29, 2025, 7:01 p.m. UTC

Previous

Evaluation
Jan. 10, 2022, midnight UTC

Current

Post-Evaluation
Feb. 1, 2022, noon UTC

End

Competition Ends
Never

CODWOE: COmparing Dictionaries and WOrd Embeddings

The CODWOE shared task invites you to compare two types of semantic descriptions: dictionary glosses and word embedding representations. Are these two types of representation equivalent? Can we generate one from the other? To study this question, we propose two subtracks: a definition modeling track (Noraset et al., 2017), where participants have to generate glosses from vectors, and a reverse dictionary track (Hill et al., 2016, a.o.), where participants have to generate vectors from glosses.

Dictionaries contain definitions, such as Merriam Webster's:

cod: any of various bottom-dwelling fishes (family Gadidae, the cod family) that usually occur in cold marine waters and often have barbels and three dorsal fins

The task of definition modeling consists in using the vector representation of co⃗d to produce the associated gloss, "any of various bottom-dwelling fishes (family Gadidae, the cod family) that usually occur in cold marine waters and often have barbels and three dorsal fins". The reverse dictionary task is the mathematical inverse: reconstruct an embedding co⃗d from the corresponding gloss.

These two tracks display a number of interesting characteristics. These tasks are obviously useful for explainable AI, since they involve converting human-readable data into machine-readable data and back. They also have a theoretical significance: both glosses and word embeddings are also representations of meaning, and therefore involve the conversion of distinct non-formal semantic representations. From a practical point of view, the ability to infer word-embeddings from dictionary resources, or dictionaries from large unannotated corpora, would prove a boon for many under-resourced languages.

How whale did it go?

Here is an overview of the official results of the competition! More info can be found on the associated github.

Definition Modeling track

user /teamRank ENRank ESRank FRRank ITRank RU
Locchi 8 6   7  
WENGSYX 9 7 6 6 6
cunliang.kong 3 2 3 1 2
IRB-NLP 2 1 1 5 5
emukans 5 4 4 4 3
guntis 6        
lukechan1231 7 5 5 3 4
pzchen 4 3 2 2 1
talent404 1        

Reverse Dictionary track, SGNS

user / teamRank ENRank ESRank FRRank ITRank RU
Locchi 4     4  
Nihed_Bendahman_ 5 5 4 6 4
WENGSYX 1 2 2 3 1
MMG   3      
chlrbgus321 N/A        
IRB-NLP 3 1 1 1 2
pzchen 2 4 3 2 3
the0ne 7        
tthhanh 8 7 6 7 6
zhwa3087 6 6 5 5 5

Reverse Dictionary track, electra

user / teamRank ENRank FRRank RU
Locchi 3    
Nihed_Bendahman_ 2 2 4
WENGSYX 4 4 2
IRB-NLP 5 3 3
pzchen 1 1 1
the0ne 6    

Reverse Dictionary track, char

user / teamRank ENRank ESRank FRRank ITRank RU
Locchi 1     4  
Nihed_Bendahman_ 2 2 2 3 4
WENGSYX 7 5 5 6 5
IRB-NLP 4 3 4 2 2
pzchen 3 1 1 1 1
the0ne 5        
zhwa3087 6 4 3 5 3

Dive right in and get started!

The data can be retrieved from our dedicated web page. See the related codalab page for more details as well.

To help participants get started, we provide a basic architecture for both tracks, a submission format checker, and the scoring script. All of this is available in our public git repository.

Keep in mind the we do not allow external data! The point is to keep results linguistically significant and easily comparable. For all details on how we will evaluate submissions, check the relevant codalab page.

What we are fishing for with this shared task

Rather than focusing strictly on getting the highest scores on a benchmark, we encourage participants to approach this shared task as a collaborative research question: how should we compare two vastly different types of semantic representations such as dictionaries and word embeddings? What caveats are there? In fact, we already have a few questions we look forward to study at the end of this shared task:

  • Do all architectures yield comparable results? Transformers, for instance, are generally hard to tune, require large amounts of data to train and have no default way of being primed with a vector: how will they fare on our two tracks?
  • What are the effects of combining different inputs? Do multilingual models fare better than monolingual models? Does handling both tracks with the same model help or hinder results?
  • Do contextual embeddings help to define polysemous words? Most approaches that use contextual embeddings in downstream applications rely on fine-tuning. Will contextual embeddings used as features also prove helpful?

These are but a few questions that we are interested in—do come up with your own to test during this shared task! To encourage participants to adopt this mindset, here are a few key elements of this shared task:

  • data from 5 languages (EN, ES, FR, IT, RU) and from multiple embedding architectures, both static and contextual, all trained on comparable corpora
  • a richly annotated trial dataset, which will be useful for the manual evaluation of your systems
  • usage of external resources is not allowed, to ensure that all submissions are comparable
  • a strong focus on manual analyses of a submitted model’s behavior during the reviewing process

As is usual for SemEval tasks, we will release all data at the end of the shared task. Depending on participants’ consent, we also plan to collect the productions of all models and reuse them in a future evaluation campaign.

Shared task timeline (this too shall bass)

Here are the key dates participants should keep in mind. Do note that these are subject to change.

  • September 3, 2021: Training data & development data made available
  • January 10, 2022: Evaluation data made available & evaluation start
  • January 31, 2022: Evaluation end
  • February 23, 2022: Paper submission due
  • March 31, 2022: Notification to authors

Camera-ready due date and SemEval 2022 workshops will be announced at a later date.

You have an issue? You need kelp? Get in touch!

There’s a google group for all prospective participants: check it out at semeval2022-dictionaries-and-word-embeddings@googlegroups.com. We also have a discord server: https://discord.gg/y8g6qXakNs. You can also reach us organizers directly at tmickus@atilf.fr; make sure to mention the SemEval task in the email subject.

Evaluation Criteria

The evaluation script is available on our git repository for reference. Note that the complete dataset is required to run all the metrics. Metrics requiring the full dataset are indicated as such in the list below. The complete dataset will be made available at the end of the competition.

Participants may not use any external resource. This requirement is to ensure that all submissions are easily comparable. We will ask participants planning to submit a system description paper to forward a link to their code.

Participants will also be invited to contribute their systems' outputs to a dataset of system productions. The purpose of this collection of system productions is to propose them as a shared task for upcoming text generation evaluation campaigns.

Metrics for the definition modeling track

Definition modeling submissions are evaluated using three metrics:

  • a MoverScore, appearing as MvSc. on the leaderboards; it is computed using the original implementation of Zhao et al. (2019).
  • a BLEU score , appearing as S-BLEU on the leaderboards. The S here stands for "sense-level", as it is computed using the target gloss as the sole reference for the production. We use the NLTK implementation.
  • a lemma-level BLEU score , appearing as L-BLEU on the leaderboards. Concretely, we compute the BLEU score for that production and all glosses with the same word and part of speech, and then select the maximum score among these. We introduce this score as some definition modeling examples share the same input (character-based embedding or word2vec representation) and yet have different targets. The complete dataset, which will be made available at the end of the competition, is required to group entries per lemma. Again, we use the NLTK implementation.

Scoring a definition modeling submission using MoverScore on CPU takes some time (15min or more). Results may not be available immediately upon submission.

Scores for distinct languages have different entries in the leaderboards, and will correspond to distinct official rankings in the task paper.

Submissions to the definition modeling track must consist of a ZIP archive containing one or more JSON files. These JSON files must contain a list of JSON objects, each of which must at least contain two keys: "id" and "gloss". The id key is used to match submissions with references. The gloss key should map to the string production to be evaluated. See our git repository for an example architecture that can output the correct JSON format.

To have your outputs scored, create a ZIP archive containing all the files you wish to submit, and upload it on CodaLab during the Evaluation phase. You can submit files for both tracks (definition modeling and reverse dictionary) at once in a single ZIP archive. Make sure that setups are unique: do not include two JSON files containing predictions for the same pair of track and language.

Do not attempt to submit glosses for different languages with a single JSON submission file. This will fail. Instead, make distinct submission files per language.

We strongly encourage you to check the format of your submission using our format checker before submitting to CodaLab. This script will also summarize how your submission will be understood by the scoring program.

Metrics for the reverse dictionary track

Reverse dictionary submissions are evaluated using three metrics:

  • mean squared error between the submission's reconstructed embedding and the reference embedding
  • cosine similarity between the submission's reconstructed embedding and the reference embedding
  • cosine-based ranking between the submission's reconstructed embedding and the reference embedding; i.e., how many other test items have a cosine with the reconstructed embedding higher than that with the reference embedding.

Scores for distinct embeddings and languages have different entries in the leaderboards, and will corresponding to distinct official rankings in the task paper.

Submissions to the reverse dictionary track must consist of a ZIP archive containing one or more JSON files. These JSON files must contain a list of JSON objects, each of which must at least contain two keys: "id" and one among "sgns", "char" or "electra", identifying which architecture your submission tries to reconstruct. The "id" key is used to match submissions with references. The other key(s) should map to the vector reconstruction to be evaluated, as a list of float components. See our git repository for an example architecture that can output the correct JSON format.

To have your outputs scored, create a ZIP archive containing all the files you wish to submit, and upload it on CodaLab during the Evaluation phase. You can submit files for both tracks (reverse dictionary and definition modeling) at once in a single ZIP archive. Make sure that setups are unique: do not include two JSON files containing predictions for the same configuration of track, language and embedding architecture.

Do not attempt to submit embeddings for different languages in a single JSON submission. This will fail. Instead, make distinct submission files per language. You may however group reconstructions for multiple architectures in a single submission file.

We strongly encourage you to check the format of your submission using our format checker before submitting to CodaLab. This script will also summarize how your submission will be understood by the scoring program.

Manual evaluations

We very strongly encourage participants to make use of the trial dataset for running manual evaluations of their systems' production. The presence of a manual evaluation in system descriptions will be taken into account during the reviewing process.

Terms and Conditions

Participants should generally adopt a spirit of good sportsmanship and avoid any unfair or otherwise unconscionable conduct. We provide the following terms and conditions to clearly delineate the guidelines to which the participants are expected to adhere. Organizers reserve the right to amend in any way the following terms, in which case modifications will be advertised through the shared task mailing list and the CodaLab forums.
Participants may contact the organizers if any of the following terms raises their concern.

Participation to the competition: Any interested person may freely participate to the competition. By participating to the competition, you agree to the terms and conditions in their entirety, without amendment or provision. By participating to the competition, you understand and agree that your scores and submissions will be made public.
Scores and submissions are understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; etc.
Participants may create teams. Participants may not be part of more than one team. Teams and participants not belonging to any team must create exactly one account to the codalab competition. Team composition may not be changed once the evaluation phase starts.

Scoring of submissions: Organizers are under no obligation to release scores. Official scores may be withheld, amended or removed if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
Up to 50 submissions will be allowed during the evaluation phase. Scores will not be visible on the leaderboards until the evaluation phase is over.
Submission files will be grouped according to the track, language, and in the case of the reverse dictionary track, the embedding architecture targeted; the last submission file per group will be understood as the team's or participant's definitive submission and ranked as such in the task description paper.

Data usage: The provided data should be used responsibly and ethically. Do not attempt to misuse it in any way, including, but not limited to, reconstructing test sets, any none-scientific use of the data, or any other unconscionable usage of the data.
During the course of the shared task, participants are not allowed to use any external data. This is to ensure that results are immediately comparable. Participants will be allowed to use external data once the evaluation phase is over for system review. All data will be released at the end of the evaluation phase.

Submission of system description papers: Participants having made at least one submission during the evaluation phase will be invited to submit a paper describing their system. As a requirement, a link to the code of systems being described will be made available to organizers or the public at large. Participants submitting a system description paper will also be asked to review papers submitted by their peers in a single-blind process.
We further encourage system description papers to include a manual analysis of their systems results and productions. The presence and quality of such an analysis will be assessed during the review process. The task description paper will also devote a significant amount of space to highlighting outstanding manual evaluations conducted by participants.

Collection of system productions: Participants having made at least one submission during the evaluation phase will be invited to submit their systems' outputs to a dataset of system productions. The purpose of this collection of system productions will solely be to propose them as a shared task for upcoming text generation evaluation campaigns.

Funding Acknowledgments: This shared task was supported by a public grant overseen by the French National Research Agency (ANR) as part of the "Investissements d'Avenir" program: Idex Lorraine Université d'Excellence (reference: ANR-15-IDEX-0004).
Future sponsors, if any, will be appended to this section.

Relevant works

In this section, we list other relevant works on Definition Modeling and Reverse Dictionary applications.

Embeddings & Dictionaries

Tom Bosc and Pascal Vincent. “Auto-Encoding Dictionary Definitions into Consistent Word Embeddings”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1522–1532. (link).

Ting-Yun Chang and Yun-Nung Chen. “What Does This Word Mean? Explaining Contextualized Embeddings with Natural Language Definition”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6064–6070. (link).

Timothee Mickus, Timothée Bernard, and Denis Paperno. “What Meaning-Form Correlation Has to Compose With: A Study of MFC on Artificial and Natural Language”. In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 3737–3749 (link).

Julien Tissier, Christophe Gravier, and Amaury Habrard. “Dict2vec : Learning Word Embeddings using Lexical Dictionaries”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 254–263 (link).

Definition Modeling

Michele Bevilacqua, Marco Maru, and Roberto Navigli. “Generationary or 'How We Went beyond Word Sense Inventories and Learned to Gloss'”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 7207–7221 (link).

Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. “Conditional Generators of Words Definitions”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 266–271.(link).

Arman Kabiri and P. Cook. “Evaluating a Multi-sense Definition Generation Model for Multiple Languages”. 2020 (link).

Thanapon Noraset et al. “Definition Modeling: Learning to define word embeddings innatural language”. In: AAAI. 2017 (link).

Liner Yang et al. Incorporating Sememes into Chinese Definition Modeling. 2019 (link).

Haitong Zhang et al. Improving Interpretability of Word Embeddings by Generating Definition and Usage (link).

Reverse Dictionary

Slaven Bila et al. “Dictionary search based on the target word description”. In: Proceedings of the 10th Annual Meeting of the Association for Natural Language Processing (ANLP 2004). 2004.

Hiram Calvo, Oscar Méndez, and Marco A. Moreno-Armendáriz. “Integrated Concept Blending with Vector Space Models”. In: Comput. Speech Lang.40.C (Nov. 2016), pp. 79–96. (link).

Dominique Dutoit and Pierre Nugues. “A Lexical Database and an Algorithm to Find Words from Definitions”. In: Proceedings of the 15th European Conference on Artificial Intelligence. ECAI'02. Lyon, France: IOS Press, 2002, pp. 450–454.

Ilknur Durgar El Khalout and Kemal Oflazer. “Use of Wordnet for Retrieving Words from Their Meanings”. In: Proceedings of the Second Global Wordnet Conference (GWC 2004). 2004, pp. 118–123.

Felix Hill et al. “Learning to Understand Phrases by Embedding the Dictionary”. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 17–30. (link).

Arman Malekzadeh, Amin Gheibi, and Ali Mohades. “PREDICT: Persian Reverse Dictionary”. Preprint. (link).

Oscar Méndez, Hiram Calvo, and Marco A. Moreno-Armendáriz. “A Reverse Dictionary Based on Semantic Analysis Using WordNet”. In: Advances in Artificial Intelligence and Its Applications. 2013, pp. 275–285.

Fanchao Qi et al. “WantWords: An Open-source Online Reverse Dictionary System”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, pp. 175–181.

Ryan Shaw et al. “Building a scalable database-driven reverse dictionary”. In: IEEE Transactions on Knowledge and Data Engineering 25.3 (2013), pp. 528–540.

Bushra Siddique and Mirza Mohd Sufyan Beg. “A Review of Reverse Dictionary: Finding Words from Concept Description”. In: Next Generation Computing Technologies on Computational Intelligence. 2019, pp. 128–139

Sushrut Thorat and Varad Choudhari. “Implementing a Reverse Dictionary, based on word definitions, using a Node-Graph Architecture”. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 2797–2806. (link).

Hang Yan et al. “BERT for Monolingual and Cross-Lingual Reverse Dictionary”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 4329–4338. (link).

Fabio Massimo Zanzotto et al. “Estimating Linear Models for Compositional Distributional Semantics”. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing, China: Coling 2010 Organizing Committee, Aug. 2010, pp. 1263–1271 (link).

Lei Zhang et al. “Multi-channel reverse dictionary model”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, pp. 312–319

Baseline results

This page lists baseline results on the development set for the two tracks, obtained with the architectures described in this sub-directory of the provided git.

Reverse Dictionary track

 MSECosineRanking
en SGNS 0.91092 0.15132 0.49030
en char 0.14776 0.79006 0.50218
en electra 1.41287 0.84283 0.49849
es SGNS 0.92996 0.20406 0.49912
es char 0.56952 0.80634 0.49778
fr SGNS 1.14050 0.19774 0.49052
fr char 0.39480 0.75852 0.49945
fr electra 1.15348 0.85629 0.49784
it SGNS 1.12536 0.20430 0.47692
it char 0.36309 0.72732 0.49663
ru SGNS 0.57683 0.25316 0.49008
ru char 0.13498 0.82624 0.49451
ru electra 0.87358 0.72086 0.49120

Definition Modeling track

 Sense-BLEULemma-BLEUMoverScore
en SGNS 0.00125 0.00250 0.10339
en char 0.00011 0.00022 0.08852
en electra 0.00165 0.00215 0.08798
es SGNS 0.01536 0.02667 0.20130
es char 0.01505 0.02471 0.19933
fr SGNS 0.00351 0.00604 0.18478
fr char 0.00280 0.00706 0.18579
fr electra 0.00219 0.00301 0.17391
it SGNS 0.02591 0.04081 0.20527
it char 0.00640 0.00919 0.15902
ru SGNS 0.01520 0.02112 0.34716
ru char 0.01313 0.01847 0.32307
ru electra 0.01189 0.01457 0.33577

Evaluation

Start: Jan. 10, 2022, midnight

Post-Evaluation

Start: Feb. 1, 2022, noon

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In