Hi
I'm trying to understand the behavior the baselines provided on the public datasets and would like some clarification on the scoring algorithms provided:
On the sharpness baseline:
When I run the code as is , I get
======= Set 1 (Task2_v1): mutual_information_v2(task_task2_v1_score)=0.103782771584 =======
======= Set 2 (Task1_v4): mutual_information_v2(task_task1_v4_score)=0.005493230773 =======
Task average: 5.463800117850
Which suggests to me that this baseline is performing poorly on task1_v4 in the public dataset
However, if I compute standard rank correlation metrics, such as from scipy.stats import kendalltau, spearmanr
I get a very different picture for the different models, if I break them down into subsets by their model number (modulo 100)
(k: kendalltau, sp: spearmanr)
task1_v4_sharpness_0xx k 0.722 sp 0.856
task1_v4_sharpness_1xx k 0.822 sp 0.931
task1_v4_sharpness_2xx k 0.485 sp 0.671
task1_v4_sharpness_5xx k 0.762 sp 0.884
task1_v4_sharpness_6xx k 0.755 sp 0.918
task1_v4_sharpness_7xx k 0.636 sp 0.825
task2_v1_sharpness_2xx k 0.748 sp 0.909
task2_v1_sharpness_6xx k 0.393 sp 0.536
task2_v1_sharpness_9xx k 0.733 sp 0.886
This plus visual inspection seems to indicate that the Sharpness baseline metric is performing
well, at least on these subsets of models
Could you please provide some clarification on how to interpret this, and, also
if possible, the actual baseline results for comparison to see if I have introduced a bug
Thank you
Posted by: charlesmartin14 @ Sept. 21, 2020, 8:44 p.m.Continuing along these lines, I have run my own baseline on the public dataset, which just the predicted training accuracy
i.e: complexity = estimate_accuracy(model, batched_ds)
On this, I find even stranger results. Here, the task average is score is quite high (higher than any baseline provides)
======= Set 1 (Task2_v1): mutual_information_v2(task_task2_v1_score)=0.071868229422 =======
======= Set 2 (Task1_v4): mutual_information_v2(task_task1_v4_score)=0.080814007337 =======
Task average: 7.634111837957
But the standard rank correlations are at best anti-correlated with the test accuracies
(again, also confirmed by visual inspection)
task1_v4_simple-test_0xx k -0.109 sp -0.167
task1_v4_simple-test_1xx k -0.25 sp -0.246
task1_v4_simple-test_2xx k -0.523 sp -0.691
task1_v4_simple-test_5xx k 0.00844 sp 0.147
task1_v4_simple-test_6xx k -0.324 sp -0.24
task1_v4_simple-test_7xx k -0.595 sp -0.708
task2_v1_simple-test_2xx k -0.238 sp -0.358
task2_v1_simple-test_6xx k 0.0404 sp -0.0052
task2_v1_simple-test_9xx k -0.733 sp -0.886
It seems rather odd to have a very high task average on this case, and a bit pointless to have
baseline models that perform less than the most trivial metric, the training accuracy
Could you kindly shed some light on this ?
Posted by: charlesmartin14 @ Sept. 21, 2020, 8:55 p.m.The numbers look reasonable I think.
This is a question of correlation vs causation. You can refer to the doc or the scoring program for the exact kind of metric we are using.
As to why Kendall tau is a misleading score, here is a thought experiments that might help:
Suppose there exists a measure that perfectly captures the depth of the network while producing random prediction if 2 networks have the same depth.
This measure would do reasonably well in terms of kendall tau because deeper models usually generalize better.
Now the dataset only contains model of 3 different depths. What would the kendall tau be?
it turns out the coefficient would be around 0.33 give or take, but would you say this is a good predictor of generalization?
For your second part, the metric we are using is actually agnostic of the direction of correlation.
It only cares about the *information* content the metric contains about generalization.
Thanks for feedback
The contest also states that the models provided have been trained to 100% accuracy
This does not appear to be the case. Perhaps I made an error in my evaluations.
The paper you cite
https://arxiv.org/pdf/1912.02178.pdf
indicates that the sharpness is a good measure of generalization
However, a simple test shows that the training error itself is a
better predictor, using just your IC metric.
Could you please clarify this point. Are the models all trained
to 100% training accuracy, both in the public data set and the
contest, or is there some variation here
The point being that, if one only considers the IC task average as a metic, and does not look at the simpler rank correlation metrics,
the results appear to be completely spurious
It would seem, at first glance, that the sharpness metric is nothing more than a crude / smoothed batch-estimate of the training error itself
as opposed to actually showing a solid correlated relationship
This is my question...why is the IC metric so different for 2 totally different cases
Posted by: charlesmartin14 @ Sept. 21, 2020, 9:28 p.m.Which is why it is important to know if the models are trained to 100% accuracy or something like 99.99% on a dataset of 50000 examples ?
Could you please clarify this?
Yes, there are some variation there. Apologies for the confusion.
We did mean to say almost 100% accuracy in the document but we missed that part (the part above it says almost 100%) which we will correct in later version.
It turns out that training hundreds of arbitrary models to 100 percent accuracy is quite difficult and some models just forever to converge.
But I think they should be above 99%.
Another difficulty is that for them to be truly at the same cross-entropy one would need to do testing after every update which is computationally impractical.
The sharpness is a good predictor for the set of models and data we considered in that paper, which is essentially similar to task2 on which it does well.
In other words, on that family of model, sharpness is a very good predictor, but not on task 1 (which we unfortunately did not have when we wrote the paper).
This is one of the reason that we are expanding the data to much broader class of models and dataset, hoping to find a more general complexity.
The IC metric is an extremely harsh metric that holds every complexity to very hight standard.
Indeed, if we could take the minimum over all tasks but we decided that it might be too harsh for the purpose of competition.
Thank you for feedback.
If the models are not trained to exactly 100% accuracy, then the winning solution may simply be looking at how the specific examples are mis-classified and projecting off that...which is interesting and perhaps very useful...but is a completely different problem than asking why things actually generalize.
Posted by: charlesmartin14 @ Sept. 21, 2020, 9:49 p.m.As long as solutions are within the guideline of the competition, anything is fair game :)
However, internally we used the test accuracy / error on the private set, and found that it was not a very good predictor.
In fact it's worse than some of the baselines we released.
Gotcha. That is helpful
I listed my result above for the baseline...I' not m doing anything fancy but I am finding that the test error is a reasonable predictor--using your measure of IC only
Moreover, I find that I can easily beat the sharpness baseline and reproduce the training error baseline IC
using any reasonable method to smooth the W matrices (i.e TruncatedSVD)
and with a little smoothing, it is easy to get the kendal tau and spearman metrics up as well.
Here's an example:
======= Set 1 (Task2_v1): mutual_information_v2(task_task2_v1_score)=0.075785200319 =======
======= Set 2 (Task1_v4): mutual_information_v2(task_task1_v4_score)=0.078060726118 =======
Task average: 7.692296321874
task1_v4_svd15_0xx k 0.618 sp 0.796
task1_v4_svd15_1xx k 0.848 sp 0.944
task1_v4_svd15_2xx k 0.758 sp 0.909
task1_v4_svd15_5xx k 0.583 sp 0.721
task1_v4_svd15_6xx k 0.6 sp 0.768
task1_v4_svd15_7xx k 0.697 sp 0.846
task2_v1_svd15_2xx k 0.817 sp 0.936
task2_v1_svd15_6xx k 0.216 sp 0.331
task2_v1_svd15_9xx k 0.467 sp 0.714
I'll submit to the contest because I'd like to see how it performs on your other models
It is an interesting contest and I appreciate the feedback here.
Posted by: charlesmartin14 @ Sept. 21, 2020, 10:06 p.m.