Can you provide an evaluation script for calculating word-level F1 and exact match scores? The terms are not clear to me given the question answering problem. Forex: If I have 2 sentences:
gold: by using a spoon pred: spoon
gold: using a waxed paper pred: using a waxed paper towel
What would be the precision, recall, and F1? Do I need to split the sentences word-wise, calculate F1, and then average the F1 scores for all sentences?
Also in my predictions does the exact word match score = 0 as both answers don't exactly match?
Also, accuracy will be used as a metric for yes/no questions. But I am not able to find any yes/no questions in the train set. Will they be present in the evaluation/test set?
Posted by: kartikaggarwal98 @ Nov. 26, 2021, 8 p.m.Hi,
Thanks for your questions. We will provide the evaluation script soon along with the validation set.
You can refer to the SQUAD paper (https://arxiv.org/pdf/1606.05250.pdf) for some details about the (Macro-averaged) F1 score. We use the same evaluation metrics. There is also a blog talking about the details of the metrics (https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html).
For the example in your post:
- Precision, recall and F1 are calculated regarding to tokens, so the sentences need to be split into tokens first. The final F1 score is the average of the F1 scores for all sentences.
- Yes, for each question, score will be zero if the answer doesn't exactly match.
There will be no yes/no questions in both train and val/test set. We decided to not include those in the final release and use EM and F1 only. We will also update the codalab page accordingly along with some change to the output format.
Sorry about the confusion and let us know if you have further questions!
- Jingxuan
Posted by: r2vq @ Nov. 29, 2021, 6:31 p.m.