PuzzLing Machines

Organized by gozdesahin - Current server time: Jan. 19, 2021, 9:04 p.m. UTC


Trial Phase
April 1, 2020, midnight UTC


Competition Phase
April 1, 2020, midnight UTC


Competition Ends
Dec. 31, 2025, 11:59 p.m. UTC


Current state-of-the-art models in many fields (e.g., computer vision, natural language processing, speech processing) utilize neural networks that require significant amounts of training data to produce strong results. However, these models lack the ability of learning from "small data" which is natural to humans---thanks to logical reasoning and common sense knowledge . On the other hand, humans are not able to process large amounts of data and make fast computations. In this task, we want to encourage researchers to build systems that combine the best of both worlds---systems that can provide state-of-the-art results by exploiting big data but can also learn from small data. We inspire from Linguistic Olympiads, which is one of the 13 recognized International Science Olympiads targeted at high-school students. Solving these puzzles do not require any prior knowledge or expertise of linguistics or language; but some logic ability and common-sense about natural languages, which we refer to as meta-linguistic knowledge.

More detailed and up-to-date information can be found on our main website.

Task Description

This shared task focuses on linguistic puzzles that are in forms of translation questions. Each puzzle consists of a small number of phrases/sentences in English and their respective translations in a lesser-known language such as Wambaya. Based on these translation pair samples, the participants need to translate new phrases/sentences into English or the foreign language.

The translation pairs from the Chickasaw puzzle (Tom Payne, 2005, Linguistics Society of America) are given below:

Ofi’at kowi’ã lhiyohli.The dog chases the cat.
Kowi’at ofi’ã lhiyohli.The cat chases the dog.
Ofi’at shoha.The dog stinks.
Ihooat hattakã hollo.The woman loves the man.
Lhiyohlili.I chase her/him.
Salhiyohli.She/he chases me.
Hilha.She/he dances.

Given these 7 sentences as parallel data, the participants are then asked to translate the following English sentences into Chickasaw:

The man loves the woman.
The cat stinks.
I love her/him.

... and the following Chickasaw sentences into English:

Ihooat sahollo.
Ofi’at hilha.
Kowi’ã lhiyohlili.


This project is a collaborative effort initiated at UKP Lab at TU Darmstadt by Gözde Gül Şahin. Feel free to contact Gözde Gül Şahin at goezde {dot} guel {at} gmail {dot} com.


We'd like to thank Liane Vogel, Marc Simon Uecker and Siddharth Singh Parihar for their great help during the project. We are grateful to Dragomir Radev for his feedback and continuous help with encoding problems encountered during annotation.


Please cite us if you use the dataset: @inproceedings{sahin20, author = {Gözde Gül Şahin, Yova Kementchedjhieva, Phillip Rust, Iryna Gurevych}, title = {Puzz{L}ing {M}achines: {A} {L}earning {F}rom {S}mall {D}ata {C}hallenge}, booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, {ACL} 2020, July 5-10, 2018, Volume 1: Long Papers}, year = {2020} }


The evaluation is separately done for each direction: English->Foreign and Foreign->English. We report the averaged scores in addition to both directions. For each answer, we calculate the following automatic measures: BLEU-2, CharacTER, ChrF-3 and exact match (EM). EM is calculated as 1 if the prediction and reference sentences match and 0 otherwise.

Puzzles are prepared in a way that they only have one answer. However the differences among languages allow for possible answers, e.g., translating a 3rd person pronoun into a non gender-marking language as "he,she or it" in English. Therefore, the participant's answer is evaluated against all alternative solutions and then the highest score is assigned.

Example evaluation for "English to Foreign" translation

Below solution is taken from a NACLO puzzle where the foreign language is Basque, and the submission is from our SMT baseline.

Solution: "Nere familiak etxe berria erosi du"
Submission: "Nire familia du kotxe berria new house"


Unlike in standard MT evaluation that use BLEU-4, we use BLEU-2 due to the dominant number of shorter phrases and sentences in our dataset. BLEU-2 is computed using modified unigram and bigram precisions. In our case, there are 2 unigram matches, so p1= 2/7; and zero bi-gram matches which makes p2= 0/6. To avoid such harsh punishments, we use the NIST geometric sequence smoothing which subtitutes null n-gram counts with 1/(2^k), where k is 1 for the first n value for which the n-gram match count is null. In that case, p2 is updated as 0,5/6. The brevity penalty (BP) will be 1 since submission is not shorter than the solution, and the weights are equally distributed as: 0.5 and 0.5. The final score is then calculated as 0.154.


This is a character based edit distance measure that is proposed for languages with richer morphological processes. It calculates the editing cost as the shifting cost plus the levenstein distance between the aligned phrases. For the example above, the algorithm first finds the ideal match by shifting phrases of the submission, so that the edit distance between the submission and the solution is minimized:
Solution: "Nere familiak etxe berria erosi du"
Submission: "Nire familia du kotxe berria new house"
Shifted Submission: "Nire familia kotxe berria new du house"
As can be seen, "du" is replaced after "new". The shifting cost is calculated as the average length of the shifted phrase, which is 2.0 in our case. Next the character-level Levenstein distance between the shifted submission and the solution is calculated and then normalized by dividing to the length of shifted submission. The calculation is then: CTER = (2.0+13.0)/38.0 = 0.39. Finally we report the value 1.0-CTER=~0.60 for convenience.


Next, we use character trigram matching F1 score, ChrF-3, which is calculated as 0.68 for the example. (We experiment with various n-gram orders for word and the characters, however we found others to have high correlation with the scores we already have.

Exact Match (EM)

This is a binary measure, which is "1" is the submission matches any of the solutions, and "0" if not, after the preprocessing steps e.g., removing punctuation. It will be then 0.0 for the example.

Example evaluation for "Foreign to English" translation

The same measures as above used for this direction as well. The additional annotation mostly occur for this direction, hence we demonstrate the preprocessing step:

Solution: "You.SG [have] killed (her/him)."
Submission: "you danced with her."
First of all, we remove the pronoun tags e.g., SG, PL both from the solution and submission. We belive these tags are most helpful while solving the puzzle, however not during evaluation. Next we remove certain punctuations and lowercase both the solution and submission. Then we create all alternative references as:
Ref1 "you have killed (her/him)"
Ref2 "you have killed her"
Ref3 "you have killed him"
Ref4 "you killed (her/him)"
Ref5 "you killed her"
Ref6 "you killed him"
We report the best scores measured over all alternative references. In this particular case, the scores are calculated as BLEU-2: 0.28, CharacTER: 0.52, ChrF-3: 0.44 and EM: 0.0.

Terms and Conditions

Our challenge provides a dataset derived from linguistic puzzles created by experts and is solely created for research purposes. The puzzles used in this shared task are compiled from various resources that may be copyrighted by the following organizations: @University of Oregon Department of Linguistics, ©2003-2019 International Linguistics Olympiad, @2007-2018 North American Computational Linguistics Open Competition, ©2013-2017 UK Linguistics Olympiad, @2008-2017 OZCLO The Australian Computational and Linguistics Olympiad, @2009 Russian Linguistics Olympiad, @2007-2009 Estonian Linguistic Olympiad, @2012 All Ireland Linguistics Olympiad. Please insert citations or copyright notices to puzzles where appropriate. The dataset is distributed under the CC BY 1.0 license.

Trial Phase

Start: April 1, 2020, midnight

Competition Phase

Start: April 1, 2020, midnight

Competition Ends

Dec. 31, 2025, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In
# Username Score
1 gozdesahin 100.0000
2 Deokjun_Eom 3.0200
3 Philippm 2.5600