## > Possible bug in F1 calculation

We have found a bug in the F1 calculation algorithm provided here:

As a result it is possible to have an exact match answer having the F1 score below 1.0, which should not be normally possible.
We would like to know if the bug is also present in the evaluation algorithm.

Details follow:

def compute_f1(prediction, truth):
pred_tokens = normalize_text(prediction).split()
truth_tokens = normalize_text(truth).split()

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
if len(pred_tokens) == 0 or len(truth_tokens) == 0:
return int(pred_tokens == truth_tokens)

common_tokens = set(pred_tokens) & set(truth_tokens)

# if there are no common tokens then f1 = 0
if len(common_tokens) == 0:
return 0

# The following two lines
prec = len(common_tokens) / len(pred_tokens)
rec = len(common_tokens) / len(truth_tokens)
# should be replaced by
prec = len(common_tokens) / len(set(pred_tokens))
rec = len(common_tokens) / len(set(truth_tokens))

return 2 * (prec * rec) / (prec + rec)

or, even cleaner:

def compute_f1(prediction, truth):
pred_tokens = set(normalize_text(prediction).split())
truth_tokens = set(normalize_text(truth).split())

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
if not pred_tokens or not truth_tokens:
return int(pred_tokens == truth_tokens)

common_tokens = pred_tokens & truth_tokens

# if there are no common tokens then f1 = 0
if not common_tokens:
return 0

prec = len(common_tokens) / len(pred_tokens)
rec = len(common_tokens) / len(truth_tokens)

return 2 * (prec * rec) / (prec + rec)

Posted by: t.dryjanski @ Jan. 12, 2022, 4:04 p.m.

A self-update: the method for F1 calculation we posted previously is still incorrect, it does not account for repeating tokens.
Here is our updated proposal:
pred_tokens = normalize_text(prediction).split()
truth_tokens = normalize_text(truth).split()

# if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
if len(pred_tokens) == 0 or len(truth_tokens) == 0:
return int(pred_tokens == truth_tokens)

common_tokens = []
extra_pred_tokens = pred_tokens.copy()
for token in truth_tokens:
if token in extra_pred_tokens:
common_tokens.append(token)
extra_pred_tokens.remove(token)

# if there are no common tokens then f1 = 0
if len(common_tokens) == 0:
return 0

prec = len(common_tokens) / len(pred_tokens)
rec = len(common_tokens) / len(truth_tokens)

return 2 * (prec * rec) / (prec + rec)

Posted by: t.dryjanski @ Jan. 18, 2022, 9:15 a.m.

Hello, are any comments for this post? Are metrics correct?
Dear Organizers can you check it, please?

Posted by: PawelBujnowski @ Jan. 20, 2022, 1:02 p.m.

Thanks for your posts. If I understand correctly, your main concern is the incorrect prediction for an unanswerable question will contribute to the F1 score. I don't that situation exists in our evaluation script, since we use null (or None in Python) to represent no-answer. It won't be normalized and split into tokens for further calculation. We will also make the evaluation script available shortly (you can download it using the link provided in the update log) to make sure it performs the same with offline experiments from participants. I hope that solves your question!

- Jingxuan

Posted by: r2vq @ Jan. 24, 2022, 3:43 p.m.