## > Evaluation Metric

On the code and evaluation section of the beetle site (https://beetl.ai/code) you mention accuracy as the evaluation metric. On the github (https://github.com/XiaoxiWei/NeurIPS_BEETL/blob/main/start_kits/4.LeaderboardLabelGeneration.ipynb) however you mention "The score is computed according to classification accuracies with weights of inverse frequency of a label.". Could you clarify which one is correct and maybe specify the formula for the weighted accuracy you mention on the github if it is the one used?

Posted by: StylianosBakas @ Aug. 25, 2021, 2:52 p.m.

Thanks for the question. For the 2 tasks each, I can confirm the score is computed according to 'classification accuracies' with weights of inverse frequency of a label. That is a type of 'accuracy', we made it a bit clearer in the tutorial than the webpage. To compute the final score of the two tasks (the balance of the two tasks), that is the equation in the 'evaluation' section in https://beetl.ai/code. Hope that helps.

Best,

Xiaoxi Wei

Posted by: BEETLCompetition @ Aug. 25, 2021, 3:40 p.m.

To clarify further, is the following formula correct? Let a1, a2, ..., ak be the class-wise accuracies, n1, n2, ..., nk the number of samples per class,

Task Score = ((1/n1) * a1 + (1/n2) * a2 + ... + (1/nk) * ak) / (1/n1 + 1/n2 + ... +1/nk)

Posted by: StylianosBakas @ Aug. 26, 2021, 9:02 a.m.

The way to compute it is - say you have X classes, then the total score of each class is 100/X. If class 1 has N1 labels (number of EEG trials for class1) to be predicted, then the score of class1 in total you get is ((100/X)/N1)*ACC1 where ACC1 is the number of your correct predictions of class1. Similar to other X-1 classes, then sum it up.

Posted by: BEETLCompetition @ Aug. 26, 2021, 10:49 a.m.

So as I understand this, it corresponds to Unweighted Average Recall / Balanced Accuracy. Thanks for the clarification.

Posted by: StylianosBakas @ Aug. 26, 2021, 10:57 a.m.

Yes, pleasure. So it has the balance of among X classes, while the inversion '1/N1' controls that class with more labels doesn't dominate the evaluation of a model. Say in sleep, for example, one stage may have way more trials than others, each trial would have less weight.

Xiaoxi

Posted by: BEETLCompetition @ Aug. 26, 2021, 11:03 a.m.