Predicting Generalization in Deep Learning Forum

Go back to competition Back to thread list Post in this thread

> Problem about Mutual Information

Hi,
I am confused about the function 'mi_from_table(table)' in mutual_information.py, as follows

def mi_from_table(table):
total_entry = table.sum()
a0_sum = np.sum(table, axis=1)
a1_sum = np.sum(table, axis=0)
a0_entropy = entropy(a0_sum)
a1_entropy = entropy(a1_sum)
mi = 0.0
for i in range(table.shape[0]):
for j in range(table.shape[1]):
joint = table[i, j] / total_entry
if joint == 0:
continue
p_i = a0_sum[i] / total_entry
p_j = a1_sum[j] / total_entry
mi += joint * np.log2(joint / (p_i * p_j))
return mi, a0_entropy, a1_entropy

The input table is a sub_table of the whole_table, format: [[a, b],[c, d]], 'a' is the number of pairs with 'positive true label, positive prediction label', and 'b': 'positive true label, negative prediction label', 'c': 'negative true label, positive prediction label', 'd': 'negative true label, negative prediction label'.

1. For the pairs with all positive true labels or all negative true labels, no matter how to predict, mi=0. e.g., predict all pairs with table=[[100, 0],[0, 0]], but mi=0 (it's common in public data); What's the meaning of this, remove some sub_tables randomly?

2. For table=[[85, 5], [5, 5]], mi = 0.09; table=[[90, 0], [10, 0]], mi = 0. They both predict 90 pairs successfully but with different mi. Does it mean e.g., if the pairs have 90 positive true labels and 10 negative true labels, to some extent predicting 10 negative pairs successfully is more inportment?

3. Maybe you can make the algorithm traverse all pairs in the sub_table.
e.g., there are three models m1, m2, m3
Now the algorithm just uses the pairs (m1, m2), (m1, m3), (m2, m3).
One of the solutions is using the pairs (m1, m2), (m2, m1), (m1, m3), (m3, m1), (m2, m3), (m3, m2).

Looking forward to your reply.

Posted by: gjin @ July 28, 2020, 12:03 a.m.

Using mutual information as the metric requires normalization by some base notion of difficulty, i.e. entropy of the random variables themselves. Entropy here is computed but not not divided immediately. The division here is done in the function total_mi_from_table, which captures the mutual information of all possible values the conditioning hyperparameter can take. The normalized mutual information you bring up is actually 1 because the entropy of the distribution is 0 (can be shown if you take the limit). If you are not convinced, you can try to use the ground truth value from model_config.json as the prediction and you will see that the ground truth always has a normalized mutual information of 1 (i.e. perfectly informative).

For your second question, in some sense, yes, because mutual information measures the statistical dependency between two random variables, i.e. how much knowing one random variable informs my knowledge about the other random variable. In [[90, 0], [10, 0]], knowing positive prediction label actually gives no information about the true label since it is never going to be negative prediction label. In the second case, knowing negative prediction label tells you that it’s a 5:5, whereas in the other case we know that it’s 85:5. I can see why this might be a bit counter-intuitive. If you have concerns about this, as always, please post it here or send us an email should you need formatting.

For the third question, I am not sure what you mean. Could you perhaps elaborate?

Posted by: ydjiang @ July 28, 2020, 2:06 a.m.

Maybe I didn't make myself clear. Just use some true examples from the public dataset to express my ideas.
1. Let whole_table = [[[13,0 ,0 ,3],
[23,0 ,0 ,14]],
[[23,18,29,34],
[22,63,52,30]]]
Then MI = [(0.016 + 0.0 + 0.0 + 0.065) / 4] / [(0.991 + 0.0 + 0.0 + 0.741)/4 ] = 0.047

2. Let whole_table = [[[13,0 ,0 ,3],
[23,0 ,0 ,14]],
[[23,0,0,34],
[22,81,81,80]]]
Then MI = [(0.016 + 0.0 + 0.0 + 0.065) / 4] / [(0.991 + 0.0 + 0.0 + 0.741)/4 ] = 0.047

The second column and third column do not have any impact on the final MI, since they are assigned with all negative true labels. And the algorithm assigns the true labels according to the hyper-parameters.
e.g., depth 6 vs depth 9, the first item in the pair with depth 6 and second item with depth 9. If all models with depth 9 generalises better than models with depth 6, all true labels would be negative ('-gap' in the algorithm). This part would not influence the final MI.

Posted by: gjin @ July 28, 2020, 9:10 a.m.

Sorry, the second whole_table =
[[[13,0 ,0 ,3],
[23,0 ,0 ,14]],
[[23,0,0,34],
[22,81,81,30]]]

Posted by: gjin @ July 28, 2020, 9:23 a.m.

How did you generate these tables? I assume the 1x4 is actually 2x2 (innermost table should only have 2 columns), but then I am not sure why there is another nested array. Assuming that’s what you meant, these tables shouldn’t be generated in practice either because all inner tables should have the same number of entries.

Posted by: ydjiang @ July 28, 2020, 3:52 p.m.

whole_table =
[[[13,0 ,0 ,3],
[23,0 ,0 ,14]],
[[23,0,0,34],
[22,81,81,30]]]
is the outer table.

The inner tables should be [[13, 23], [23, 22]], [[0, 0], [0, 81]], [[0, 0], [0, 81]] and [[3,14], [34, 30]].

Posted by: gjin @ July 28, 2020, 5:15 p.m.
Post in this thread