We have found a bug in the F1 calculation algorithm provided here:

https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html

As a result it is possible to have an exact match answer having the F1 score below 1.0, which should not be normally possible.

We would like to know if the bug is also present in the evaluation algorithm.

Details follow:

def compute_f1(prediction, truth):

pred_tokens = normalize_text(prediction).split()

truth_tokens = normalize_text(truth).split()

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise

if len(pred_tokens) == 0 or len(truth_tokens) == 0:

return int(pred_tokens == truth_tokens)

common_tokens = set(pred_tokens) & set(truth_tokens)

# if there are no common tokens then f1 = 0

if len(common_tokens) == 0:

return 0

# The following two lines

prec = len(common_tokens) / len(pred_tokens)

rec = len(common_tokens) / len(truth_tokens)

# should be replaced by

prec = len(common_tokens) / len(set(pred_tokens))

rec = len(common_tokens) / len(set(truth_tokens))

return 2 * (prec * rec) / (prec + rec)

or, even cleaner:

def compute_f1(prediction, truth):

pred_tokens = set(normalize_text(prediction).split())

truth_tokens = set(normalize_text(truth).split())

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise

if not pred_tokens or not truth_tokens:

return int(pred_tokens == truth_tokens)

common_tokens = pred_tokens & truth_tokens

# if there are no common tokens then f1 = 0

if not common_tokens:

return 0

prec = len(common_tokens) / len(pred_tokens)

rec = len(common_tokens) / len(truth_tokens)

return 2 * (prec * rec) / (prec + rec)

A self-update: the method for F1 calculation we posted previously is still incorrect, it does not account for repeating tokens.

Here is our updated proposal:

pred_tokens = normalize_text(prediction).split()

truth_tokens = normalize_text(truth).split()

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise

if len(pred_tokens) == 0 or len(truth_tokens) == 0:

return int(pred_tokens == truth_tokens)

common_tokens = []

extra_pred_tokens = pred_tokens.copy()

for token in truth_tokens:

if token in extra_pred_tokens:

common_tokens.append(token)

extra_pred_tokens.remove(token)

# if there are no common tokens then f1 = 0

if len(common_tokens) == 0:

return 0

prec = len(common_tokens) / len(pred_tokens)

rec = len(common_tokens) / len(truth_tokens)

return 2 * (prec * rec) / (prec + rec)

Hello, are any comments for this post? Are metrics correct?

Dear Organizers can you check it, please?

Thanks for your posts. If I understand correctly, your main concern is the incorrect prediction for an unanswerable question will contribute to the F1 score. I don't that situation exists in our evaluation script, since we use null (or None in Python) to represent no-answer. It won't be normalized and split into tokens for further calculation. We will also make the evaluation script available shortly (you can download it using the link provided in the update log) to make sure it performs the same with offline experiments from participants. I hope that solves your question!

- Jingxuan

Posted by: r2vq @ Jan. 24, 2022, 3:43 p.m.Thank you for your answer.

Our concern was related to the original reference on the competition page, allowing wrong F1 metric calculation.

We have no issues with the newly linked code, so we are happy to close this thread.

Regards,

Tomasz