ECCV 2020 ChaLearn Looking at People Fair Face Recognition challenge Forum

Go back to competition Back to thread list Post in this thread

> Clarification on accuracy/bias calculation

Can we get more information on accuracy/bias calculation?

For accuracy calculation, I'm guessing that the true answers are 0 (different people) or 1 (same people) while our confidence scores are normalized into [0, 1] and interpreted as probabilities for AUC-ROC calculation.

For bias calculation, however, it's unclear to me how 'accuracy' (as used in the bias definition) is calculated. Isn't AUC-ROC not calculable if the true labels are homogeneous (only 0's, only 1's)? That is, how is bias calculated for positive pairs and negative pairs? Is 'accuracy' (as used in the bias definition) the average of the confidence scores?

If there is no test pair for a particular combination of attributes, is d_a(X=x) treated as NaN and ignored in the bias calculation?

Lastly, how is bias calculated for negative pairs when the two identities may have different attributes (gender, skin color, glasses, size, etc.)? Is it calculated just based on TEMPLATE_ID1? Or is bias calculated separately, one based on TEMPLATE_ID1 and another TEMPLATE_ID2 and averaged at the end? More explanation would be greatly appreciated.

Posted by: suhk @ April 23, 2020, 9:59 p.m.

Hi suhk,

I will answer each question as it goes:

>> For accuracy calculation, I'm guessing that the true answers are 0 (different people) or 1 (same people) while our confidence scores are normalized into [0, 1] and interpreted as probabilities for AUC-ROC calculation.
In fact in the submission we also expect confidence scores, the higher score the higher confidence that the image pair contains the same person. So scores normalized to [0, 1] will work (more detailed description is at http://chalearnlap.cvc.uab.es/challenge/38/description/ section Starting-kit).

>> For bias calculation, however, it's unclear to me how 'accuracy' (as used in the bias definition) is calculated. Isn't AUC-ROC not calculable if the true labels are homogeneous (only 0's, only 1's)? That is, how is bias calculated for positive pairs and negative pairs? Is 'accuracy' (as used in the bias definition) the average of the confidence scores?
This is related to the previous answer - since the expected format are confidence scores, AUC-ROC can be (and it is) used as a measure of accuracy.

>> If there is no test pair for a particular combination of attributes, is d_a(X=x) treated as NaN and ignored in the bias calculation?
It is as you write, such combinations are ignored. Or better, a combination of (legitimate) attributes X is considered only if there are at least 10 pairs for every combination of protected attributes A (so minimum 40 in total).

>> Lastly, how is bias calculated for negative pairs when the two identities may have different attributes (gender, skin color, glasses, size, etc.)? Is it calculated just based on TEMPLATE_ID1? Or is bias calculated separately, one based on TEMPLATE_ID1 and another TEMPLATE_ID2 and averaged at the end? More explanation would be greatly appreciated.
The validation and testing pairs were generated specifically with this in mind, so that two images could form a pair only if their protected attributes (gender and skin colour) are the same (this is by default for positive pairs). This was motivated by interpretability, mixed pairs wouldn't allow to find out, for which group of people the algorithm works better/worse. There is no such restriction for the legitimate attributes, because glasses, head pose etc. can be different even for the same person (age too, for example if someone has an old photo in their passport).
So on high level the calculation of bias works as follows (I'll use only age & glasses to make it shorter)
1) Take all image pairs with the same legitimate attributes, e.g. age=1,glasses=0,age=2,glasses=1 (there are two images, so every attribute is there twice too)
2) Split these pairs by protected attributes: A0: gender=0,skin=0,gender=0,skin=0, A1: gender=0,skin=1,gender=0,skin=1, A2: gender=1,skin=0,gender=1,skin=0, A3 gender=1,skin=1,gender=1,skin=1
3) Calculate AUC-ROC for all A0, A1, A2 and A3.
4) Calculate discrimination for each A0-A3, here defined as a difference in accuracy for the most accurate group and the given one
5) Do this for all combinations of legitimate attributes (for which there is enough data) and calculate average discrimination for each A0-A3
6) The bias is the difference between the most and the least discriminated A0-A3

Tomas

Posted by: tomass @ April 24, 2020, 10:13 a.m.

Dear Tomas,

Thank you for the detailed explanation. I have one quick follow-up question about AUC-ROC calculation.

ROC curve exists with true binary labels and continuous confidence scores, so I see how AUC-ROC can be calculated on all evaluation pairs consisting of both positive and negative pairs. However, from what I understand, AUC-ROC cannot be calculated on only positive pairs or on only negative pairs. (To be clear, I'm fine with the continuous confidence scores part. I'm concerned about the true binary labels being just 0's or 1's. For example for only positive pairs, FPR=0, TNR=0 regardless of the continuous confidence scores.) So in bias calculation, how is AUC-ROC calculated separately for positive pairs and negative pairs? I think I'm missing something.

Thank you in advance!

Posted by: suhk @ April 24, 2020, 5:19 p.m.

Hi suhk,

you are right that if the true labels were only 0 or only 1 ROC curve would be undefined. So for bias (for example in positive pairs), for given values of legitimate attributes and given protected group A0-A3 the positive samples are just pairs with these attributes and the negative samples are all available negative pairs (even with different values of legitimate/protected attributes), so a ROC can be calculated. And this decision (using all negative pairs) is motivated by how face verification systems are deployed in practice - basically whoever is using the system would get several thresholds and for each of them what TPR and FPR to expect. The thresholds would be obtained from all available testing data (i.e. there wouldn't be different thresholds for different protected groups) so that's why we use all negative pairs in bias ROCs (and for bias in negative samples the description would be the same only with swapped positive and negative).

Tomas

Posted by: tomass @ April 25, 2020, 10:14 a.m.

Hi Tomas, just to make sure I'm understanding correctly:

1. Pairs are formed within protected groups (female-dark, female-bright, male-dark, male-bright). So for example, no pair can contain one person from the female-dark group and another from the male-dark group.
2. Positive pairs are pairs with the same legitimate attributes( all five have to match). Positive pairs =! Positive matches.
3. Negative pairs are pairs with different legitimate attributes (it's fine if only one differs). Negative pairs =! Negative matches.

Posted by: suhk @ May 16, 2020, 9:24 p.m.

Hi suhk,

it's like this:

1. This is true, pairs are formed only within protected groups (of course there could be some annotation errors, but they should be rare).

2., 3.
A positive pair is a pair of images that contain the same person and a negative pair is a pair of images that contain different person.
Pairs are matches have the same meaning here, although I understand that the word "match" can lead to misunderstandings. In the original IJB-C dataset which makes large portion of our dataset "match" is used in the sense of "pair of images to be matched/compared with each other". Initially we kept this terminology, but gradually shifted to more neutral "pair", but in some texts "match" remained, which is what I'm mostly guilty of :-)
And finally values of legitimate attributes have no relation to whether it's a positive or a negative pair.

Tomas

Posted by: tomass @ May 17, 2020, 8:56 a.m.

Hi Tomas, thanks for the clarification. However, this brings me back to my previous question (which I know you answered but I'm still confused):

In positive pairs the true labels are always 1 and in negative pairs they are always 0. ROC curve cannot be defined for just positive pairs or just negative pairs, no? Then I don't see how we can calculate accuracy (AUC-ROC) separately for positive pairs and negative pairs.

Sorry about the back and forth and thanks for your prompt reply!

Posted by: suhk @ May 17, 2020, 1:38 p.m.

No problem :-)
Let's start with the overall accuracy (to emphasize better what I mean) - we take all available positive pairs (all genders, skin colours, glasses/no glasses etc.) + all available negative pairs, use these to calculate ROC and area under this ROC is then the accuracy of the algorithm on the whole testset. Now, if we want to calculate accuracy for some group of positive pairs (like dark females wearing glasses), we take only positive pairs of dark females wearing glasses + all available negative pairs (all genders, skin colours, glasses/no glasses etc.),, use these to calculate ROC and area under this ROC is then what we call accuracy for this group.
The approach for groups of negative pairs is the same but here it's easy to get lost in what negative and positive means. In the previous example positive pairs were used as positive samples for ROC and negative pairs as negative samples for ROC. In this case negative pairs play the role of positive samples for ROC and positive pairs have the role of negative samples for ROC.

Posted by: tomass @ May 17, 2020, 3:32 p.m.

Ah it finally makes sense! Thank you for the detailed response.
I've been enjoying thinking about the challenge. Thanks for all your and other organizers' efforts in organizing it!

Posted by: suhk @ May 17, 2020, 3:36 p.m.

Hello, I have one last question regarding the bias calculation.
Suppose in a female-bright positive pair, one image has attributes (age 0, glasses 0, bounding box 0, source 0, head pose 0) and the other has attributes (age 0, glasses 1, bounding box 0, source 0, head pose 0). Then is this pair included in the calculation of bias for both groups (female, bright, age 0, glasses 0, bounding box 0, source 0, head pose 0) and (female, bright, age 0, glasses 1, bounding box 0, source 0, head pose 0)? So every pair will be counted twice unless the legitimate attributes exactly match (in which case it will be included once)?

Posted by: suhk @ May 18, 2020, 3:27 a.m.

Hi,

the subgroups of the protected groups are actually constructed using both images in the pair. So in your example, within the group of female-bright-positive there would be a subgroup a0g0b0s0h0-a0g1b0s0h0. And to prevent the problem that some pairs would be potentially counted twice and others only once, if there was another pair of images from female-bright-positive with attributes a0g1b0s0h0-a0g0b0s0h0, they were swapped in the preprocessing so that they belong to the same subgroup.

Tomas

Posted by: tomass @ May 18, 2020, 10:25 a.m.

Hi, what's the a' means in formula http://chalearnlap.cvc.uab.es/challenge/38/description/ Evaluation metric?

Posted by: zhangkun @ May 24, 2020, 4:16 p.m.

Hi,

in the formula d_a(X=x) = ..., both a and a' denote protected groups. "a" is a single protected group for which we want to calculate the discrimination (for example women with dark colour of skin). a' is used in the max operator, such that max_{a'} means "find maximum (accuracy) over all protected groups". So the whole formula can be interpreted as a difference between a group, for which the algorithm works the best, and a group "a" of interest (e.g. women with dark colour of skin).

Tomas

Posted by: tomass @ May 24, 2020, 9:10 p.m.

Hi, i have another question.
In the output results log,
bias_in_positive_samples: 'GENDER_SKINCOLOUR': measure_of_bias=0.020321558365837232 ;
'bias_in_negative_samples': 'GENDER_SKINCOLOUR': measure_of_bias=0.004895953378000998,
But in the leardbord results:
bias(positive pairs) is 0.004896 , bias(negative pairs) is 0.020322 . why?

by the way, the bias of Gender and Skin color is used for ranking?

=====Results=====
{'bias_in_positive_samples': {'GENDER': measure_of_bias=0.004776020851682196,most_discriminated_aF_key=G1-G1,least_discriminated_aF_key=G0-G0, 'SKIN_COLOUR': measure_of_bias=0.01286966580039876,most_discriminated_aF_key=S1-S1,least_discriminated_aF_key=S0-S0, 'GENDER_SKINCOLOUR': measure_of_bias=0.020321558365837232,most_discriminated_aF_key=G0S1-G0S1,least_discriminated_aF_key=G0S0-G0S0}, 'bias_in_negative_samples': {'GENDER': measure_of_bias=0.001948449749331623,most_discriminated_aF_key=G1-G1,least_discriminated_aF_key=G0-G0, 'SKIN_COLOUR': measure_of_bias=0.001204694203280691,most_discriminated_aF_key=S1-S1,least_discriminated_aF_key=S0-S0, 'GENDER_SKINCOLOUR': measure_of_bias=0.004895953378000998,most_discriminated_aF_key=G1S1-G1S1,least_discriminated_aF_key=G0S0-G0S0}, 'auc': 0.9682083311178858}

Posted by: zhangkun @ May 25, 2020, 12:57 a.m.

there are any available code to help us understanding the evaluation metric?

Posted by: zhaixingzi @ May 25, 2020, 4:01 a.m.

Hi,

so to the previous two posts:

Script vs. leaderboard difference: This looks like something that sneaked through our beta testing. The results of the script are correct but the names of columns of the leaderboard are swapped (so the column named as Bias (positive pairs) should be named Bias (negative pairs) and vice versa). Unfortunately codalab has some limitations that don't allow us to fix this, so instead you should receive an email with more detailed explanation (we also posted it here in the forum). This doesn't affect the rankings of the submitted methods, but nevertheless we hope it won't cause too much confusion.

Gender and Skin colour biases: These "aggregated" biases are not used to rank the methods, but the script outputs it anyway, because it could be interesting for comparison, debugging, or just interesting :-) The disaggregated biases are more informative and more challenging, so they are better for exposing weaknesses of the algorithms.

Releasing the code: We plan to release the code for the evaluation metrics in near future after it's polished and better documented.

Tomas

Posted by: tomass @ May 25, 2020, 8:46 p.m.

Hi
how the val pairs are collected? I want to split some subject ids from train set and building a 'val set’ to meature my algorithm. But, i found that the numn of all pairs is far from 100,0000 when I choiced 614 subject ids and make sure the protected attribute is same for each pair. So, i want to know how to build a 'val set' with the same scal of the real val set(100,0000+ pair, 40,0000+ positive pair).

Posted by: zhaixingzi @ May 28, 2020, 8:33 a.m.

Hi,

I can give you some hints how to approach this. In any reasonable sized subset of subject ids, there will always be more negative pairs than positive pairs and there will be more variability in legitimate attributes in the negative pairs too, so you may make your life easier by selecting the subject ids so that the positive-pairs variability is maximized. You may start with iterating over all possible positive pairs and for each one record its combination of legitimate attributes. Every subject id will have certain number of these combinations. You can sort the subjects in descending order by this number and select those with high variability. But instead of just taking the top 614 subject ids, it's better to construct your validation set iteratively adding one subject at a time, such that it increases the variability of the val set as much as possible. It's also a good idea to ignore very rare combinations, because they can act as a noise. It's lots of data engineering but I hope these hints are useful.

Tomas

Posted by: tomass @ May 28, 2020, 1:40 p.m.

Thanks for your reply.

“But instead of just taking the top 614 subject ids, it's better to construct your validation set iteratively adding one subject at a time, such that it increases the variability of the val set as much as possible.”
Could you give me a more explantion about “instead of just taking the top 614 subject ids, it's better to construct your validation set iteratively adding one subject at a time”, I’m so confused about this.
Could you give us a description about how the val set was buid? why the positive pair is 448,119 and negtative pair is 552,672?
Do you choice 614 subject id firstly? Then, to build val set?

In val set, there are four combinations for protected attributes. For each combination, the number of pairs is same?
how do you solve the problem that there are so many pairs for some specific combinations?

Posted by: zhaixingzi @ May 29, 2020, 7:03 a.m.

The approach i sketched was actually used to construct the validation set :-) First, the subjectids were selected, then we generated all possible pairs of images of these subjects and then selected roughly 500000 positive and negative pairs (as you noted, the numbers aren't exact). At this moment I can't reveal too much details about the actual validation set, because it could allow to overfit the algorithms using that info, so let me just give you generic answers:

It's not practical to require equal numbers of image pairs for all combinations of attributes, depending how thorough you would be you could end up with very small number of pairs. However, if you have 100 pairs for one combination, 10000 for another and want to keep only some of them, it's better to reduce pairs from the more numerous combination, because you are not losing that much variability.

And regarding sorting subjectids by their variability in legitimate attributes: Let's for the moment ignore protected attributes and suppose that there are 3 subjects and you want to select 2 of them for the validation set. Suppose that if you generate all pairs from images of subject A and record their combinations of legitimate attributes, there will be three distinct ones, let's denote them c1, c2 and c3. Pairs generated from images of subject B will have combinations c2 and c3 and pairs from images subject C will all have only c4. So in more compact form:
subject A: c1, c2, c3
subject B: c2, c3
subject C: c4
If you dully selected subjects with the highest variability (A and B), the val set would contain only 3 different combinations, but if you selected A and C, you would get more variability in your validation set.

Posted by: tomass @ May 29, 2020, 3:27 p.m.

Could you tell me the numer of legitimate attributes's combination in pos pairs and neg pairs?

Posted by: zhaixingzi @ June 2, 2020, 5:42 a.m.

Hi,

there are about 100 combinations in positive pairs and about 300 in negative ones.

Tomas

Posted by: tomass @ June 2, 2020, 8:39 a.m.

Hi,
Have you picked out the most variety subject ids? I just got 52 combinations for positive pairs in the training set. .

Posted by: zhaixingzi @ June 2, 2020, 11:44 a.m.

Hi, yes, you are right, the numbers are for the validation set. The selection process was like take the most diverse 1228 subjects (20 %) for the test set, from what remains take the most diverse 614 (10 %) for the validation set and use the remaining data for the training set. This makes it more challenging, because a good algorithm has to generalize to unseen combinations of legitimate attributes.

Tomas

Posted by: tomass @ June 2, 2020, 3:02 p.m.

Is it possible to release the val set ground truth label in advance? It's too difficult to analysis the developing algorithm without the ground truth label .
Thank you.

Posted by: zhaixingzi @ June 10, 2020, 8:23 a.m.

Hi,

the validation ground truth will be released on 06/19/2020, as specified in the schedule: http://chalearnlap.cvc.uab.es/challenge/38/schedule/

Tomas

Posted by: tomass @ June 10, 2020, 9:26 a.m.

Do you still have the plan to release the code for the evaluation metrics?

Posted by: zhaixingzi @ June 23, 2020, 2:05 a.m.

Hi,

we decided to release the code with the end of the competition, which is 01/07.

Tomas

Posted by: tomass @ June 23, 2020, 3:14 p.m.

Hi, Do you still have the plan to release the code for the evaluation metrics?

Posted by: mengtzu.chiu @ Aug. 28, 2020, 7:11 a.m.

Hi,

sorry for this, looks like something we should have done long time ago. So, we will release the code on Monday (I will write here the link).

Tomas

Posted by: tomass @ Aug. 28, 2020, 12:45 p.m.

Hi again, the link to the evaluation script was added to the section Evaluation metric here: http://chalearnlap.cvc.uab.es/challenge/38/description/ (or the direct link is http://158.109.8.102/FairFaceRec/webpage/evaluation.zip).

Posted by: tomass @ Aug. 31, 2020, 6:55 p.m.
Post in this thread