Mark Hopkins (Reed College), Ronan Le Bras, Cristian Petrescu-Prahova, Gabriel Stanovsky (Allen Institute for Artificial Intelligence), Hannaneh Hajishirzi, Rik Koncel-Kedziorski (University of Washington)
semeval-2019-task-10@googlegroups.com
Go here to get started.
Over the past four years, there has been a surge of interest in math question answering. In this SemEval task, we provide the opportunity for math QA systems to test themselves against a benchmark designed to evaluate high school students: The Math SAT (short for Scholastic Achievement Test).
The training and test data consists of unabridged practice exams from various study guides, for the (now retired) exam format administered from 2005 to 2016. We have tagged questions into three broad categories:
A majority of the questions are 5-way multiple choice, and a minority have a numeric answer. Only the Geometry subset contains diagrams.
We are planning to provide 3000-4000 training questions, and a test set of over 1000 questions. Questions are stored as JSON, using LaTeX to encode mathematical formatting.
{
"id": 846,
"exam": "source4",
"sectionNumber": 2,
"sectionLength": 20,
"originalQuestionNumber": 18,
"question": "In the figure above, if the slope of line l is \\(-\\frac{3}{2}\\), what is the area of triangle AOB?",
"answer": "E",
"choices": {
"E": "12",
"A": "24",
"B": "18",
"C": "16",
"D": "14"
},
"diagramRef": "diagram250.png",
"tags": ["geometry"]
}
Additionally, we will provide gold logical forms for a majority of the training questions in the Closed Algebra track. These logical forms are the same language used in the paper:
Hopkins, M., Petrescu-Prahova, C., Levin, R., Le Bras, R., Herrasti, A., & Joshi, V. (2017). Beyond sentential semantic parsing: Tackling the math sat with a cascade of tree transducers. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 795-804).
We will also be providing documentation and an interpreter for the logical form language. Competitors are free to ignore the provided logical forms if desired. Competitors will also be free to use additional publicly available math training questions, like AQuA or MAWPS; we ask only that competitors refrain from using additional Math SAT questions found on the web or elsewhere, to avoid potential train/test overlap.
Evaluation will be based solely on a system's ability to answer questions correctly.
For each subtask, the main evaluation metric will simply be question accuracy, i.e. the number of correctly answered questions. The evaluation script takes as input a list of JSON datum { id: <id>, answer: "<response>"}, where <id> is the integer index of a question and <response> is the guessed response (either a choice key or a numeric string). It will output the system’s score as the number of correct responses divided by the total number of questions in the subtask.
While the main evaluation metric includes no penalties for guessing, we will also compute a secondary metric called penalized accuracy that implements the actual evaluation metric used to score these SATs. This metric is the number of correct questions, minus 1/4 point for each incorrect guess. We include this metric to challenge participants to investigate high-precision QA systems.
By submitting results to this competition, you consent to the public release of your scores at the SemEval-2018 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics at the task organizers' discretion. You accept that the ultimate decision of metric choice and score value is that of the task organizers. You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers. You agree to respect the following statements about the dataset:
Start: June 26, 2018, midnight
Description: A handful of train/dev questions are provided as examples.
Start: Aug. 17, 2018, midnight
Description: The full training and development set are provided for system hillclimbing.
Start: Jan. 10, 2019, midnight
Description: The competition's test set is now available.
Start: Jan. 24, 2019, midnight
Description: Although the main evaluation phase is now closed, interested competitors may continue to submit their best system results to the post-evaluation leaderboard.
Never
You must be logged in to participate in competitions.
Sign In