Predicting Generalization in Deep Learning Forum

Go back to competition Back to thread list Post in this thread

> sharpness measure performs well locally on public data but very poorly on codalab

Why is there such a large margin between the score for sharpness v2 baseline on public data on my local machine vs running on codalab? Any suggestions or ideas are much appreciated! I understand that we'd expect the measure to perform worse on codalab since there are more models used to generate the task predictions, but I am surprised sharpness performs so poorly (for me it's poorer than simpler methods like jacobian norm or distance from init)

When I run the sharpness v2 baseline on the public data on my machine I get the following output with score of around 10:
solution_dir: ../public_data/reference_data
prediction_dir: sharpness_result_submission/trial1/
score_dir: sharpness_score/trial1/

['task1_v4', 'task2_v1']
Read prediction from: sharpness_result_submission/trial1/task1_v4.predict
Read model configs from: ../public_data/reference_data/task1_v4/model_configs.json
Data is valid :)
Start computing score for task1_v4
Score computation finished...
======= Set 1 (Task1_v4): mutual_information_v2(task_task1_v4_score)=0.019042830248 =======
Read prediction from: sharpness_result_submission/trial1/task2_v1.predict
Read model configs from: ../public_data/reference_data/task2_v1/model_configs.json
Data is valid :)
Start computing score for task2_v1
Score computation finished...
======= Set 2 (Task2_v1): mutual_information_v2(task_task2_v1_score)=0.178353304254 =======
Task average: 9.869806725114

However, on codalab I get an output with a much lower score of 0.7 :
['task4', 'task5']
Read prediction from: /home/codalab/tmpXNXIGf/run/input/res/task4.predict
Read model configs from: /home/codalab/tmpXNXIGf/run/input/ref/reference_data/task4/model_configs.json
Data is valid :)
Start computing score for task4
Score computation finished...
======= Set 1 (Task4): mutual_information_v2(task_task4_score)=0.010826408784 =======
Read prediction from: /home/codalab/tmpXNXIGf/run/input/res/task5.predict
Read model configs from: /home/codalab/tmpXNXIGf/run/input/ref/reference_data/task5/model_configs.json
Data is valid :)
Start computing score for task5
Score computation finished...
======= Set 2 (Task5): mutual_information_v2(task_task5_score)=0.003281116961 =======
Task average: 0.705376287206

Posted by: anican @ Sept. 12, 2020, 10:56 p.m.

There is no relations between the tasks of the public dataset that you have access to and the online private tasks (they have different architecture and different data). They are in some sense "not from the same distribution".
As a result, a complexity that works well on certain types of models can fail on others, thus the discrepancies in the scores you are observing.
We understand this is a challenging task but we are excited to see what the participants will come up with!

Posted by: ydjiang @ Sept. 13, 2020, 10:35 p.m.

ahhh ok thanks for the clarification haha :)

Posted by: anican @ Sept. 14, 2020, 1:54 a.m.

when you say different data, this means datasets aside from just cifar10 and svhn, correct?

Posted by: anican @ Sept. 14, 2020, 1:58 a.m.

That is correct.

Posted by: ydjiang @ Sept. 14, 2020, 3:30 a.m.

can we get information on the dataset shapes we should be able to accomodate? For example, in a complexity measure should I be able to accommodate datum that are of a shape other than (32, 32, 3) (3072 total features). If so, what are the other shapes we should expect to run into?

Posted by: anican @ Sept. 15, 2020, 11:16 p.m.

We are still finalizing the details but for now I would say 32x32xc where c is either 3 for colored image and 1 for black and white images.

Posted by: ydjiang @ Sept. 15, 2020, 11:19 p.m.

awesome, thank you for the response! Also, is there a way to get access to the dataset size in the form that is sent to complexity.py? I know you can iterate over the whole thing, but I'm wondering if there is a better way using the tf Dataset API? I looked over the documentation, but wasn't able to find anything

Posted by: anican @ Sept. 16, 2020, 12:24 a.m.

Unfortunate I don't think Tensorflow dataset comes with that functionality natively (I remember there being some technical reasons).

Posted by: ydjiang @ Sept. 18, 2020, 4:07 a.m.

Hi, ydjiang. When you say different architecture in the online private tasks, it means vgg/network in network with different hyper-parameters trained on different dataset, right?

Posted by: LinZhang @ Oct. 17, 2020, 12:42 p.m.

Moreover, if I understand correctly, the sharpness measure is the second measure mentioned in https://drive.google.com/file/d/1maCI6XNxlAGjtqgzo7WKlOdI4_WBmX1P/view, which is the true generalization gap with added noises. So, I still don't get it why it performs poorly on private data, even the private data includes different architectures trained with different datasets.

Posted by: LinZhang @ Oct. 17, 2020, 1:24 p.m.

No, it means completely different architecture and completely different data.
If you look at the scores on public data by tasks., sharpness actually did not perform well on vgg, which was not in the original paper.

Posted by: ydjiang @ Oct. 17, 2020, 5:20 p.m.

I see. Thanks!

Posted by: LinZhang @ Oct. 18, 2020, 2:27 a.m.
Post in this thread