I believe this contest uses the normalized mutual information metric from the paper "Fantastic Generalization Measures and Where to Find Them". What I find puzzling, is why causation is even an issue? Let's call "g" the true generalization value. And "g_hat" the generalization metric people are trying to come up with this in this contest. We are trying to discover a function that takes as input all hyperparameters and properties of the data and "mimics" the output of "g" since we don't know how "g" works. We are attempting to reconstruct it.
The paper introduces a causal relationship such that "g_hat" causes "g". This is non-sensical. The only way "g_hat" can cause "g" to change is if "g_hat":
(1) Modifies the neural network architecture itself
(2) Modifies the network parameter values
Both won't happen in practice. The correct casual structure does not need to be "discovered", it's simple:
- All hyperparameters are nodes without parents. They BOTH point (in general) to "g" and "g_hat". (In general, because "g_hat" might only be a function of some hyperparameters, not all)
- Both "g" and "g_hat" point to "tau". Where tau(g, g_hat). That is, some type of performance metric (e.g., Kendalls Tau)
That's it! Now, the concern that some metrics do better than others for certain hyperparameters is a legitimate concern. That's why we need to marginalize out al; hyperparameters. That is:
E[tau] = sum_(all possible hyperparameter subsets "Hs") E[tau | Hs] P(Hs)Posted by: rjohnson0186 @ Sept. 26, 2020, 4:29 p.m.
I think it's better if you think about causation in a statistical sense rather than physical sense.
In general it is possible for "g_hat" to modify the network and "cause" generalization, and that is in a sense what regularization is doing.
Regularizer like l_2 norm captures a form of complexity of the hypothesis class and the hope is that minimizing it would improve generalization.
In principle, after you find a good complexity you can also try to intervene on that complexity to improve your classifier.
In a perfect world, all one needs to do is to intervene on the complexity but unfortunately this is not always possible because the complexity might be hard to optimize with gradients.
Nonetheless, it is interesting to know what that complexity is which will further our understanding of generalization.
Re: causal structure
Suppose I have a node that is true g plus some gaussian noise. Conditioning on this node, any hyperparameter would not give more information, which means this node is causal statistically but not necessarily in a physical sense.
What we are interested is that if such node exists without looking at the test set.
Re: other metric
The scheme that you describes is basically what mutual information does except the marginalization is replaced with a infimum, which focuses on the worst performing hyperparameter set.
The philosophy here is a good complexity should work for any hyperparamter sets.
Thanks for the quick response!
A few points:
"I think it's better if you think about causation in a statistical sense rather than a physical sense."
This is literally what Pearl has been arguing against for decades. Causation goes _outside_ the data. The simple example Pearl always gives is Simpson's paradox. Depending on what variables you condition on, you reach different conclusions. Pearl's conclusion: causality is beyond statistics (just as a side note, I do think one day we will have machines robustly infer causal structure...but we are not there yet)
The physical sense IS the way we should be thinking about causality. It's all about interventional probability distributions, not observational ones (i.e., E[X | do(Y)] vs E[X | Y]). That is, what if I physically set a variable to Y (this destroys all incoming arrows) and now, how does the change propagate downstream.
"In general it is possible for "g_hat" to modify the network and "cause" generalization, and that is in a sense what regularization is doing."
But in this contest, we take "g_hat" to be a function of what hyperparameters the network used, what data was used, etc. This is all GIVEN trained model(s). So the regularization is completed. How can a post-measurement change this? I must be missing something here.
"The philosophy here is a good complexity should work for any hyperparameter sets."
Yes, I definitely agree with this.Posted by: rjohnson0186 @ Sept. 26, 2020, 5:49 p.m.
You can interpret this as the assumption we make here is none of the hyperparameters is the cause of generalization and there is some quantity that is affected by all of them.
I am not sure if I understood your second comment.
The regularization is done but we do not know what (the single quantity above) is being regularized that caused the generalization to change.
We are essentially trying to propose candidates for this quantity and measure if it's a good one via the conditional independence test (since we have every model in the universe we constrained the models to be in).
Do you agree that using infimum is a better than using expectation? If so then the metric is basically what you suggested with the exception that it looks at the worst case performance and is agnostic to the direction of correlation.Posted by: ydjiang @ Sept. 26, 2020, 6:14 p.m.
I think it's important to note that by construction of the data the only source of randomness is the choice of the hyperparameter.
There is no confounder on these hyperparameters because we are setting them. Every hyperparameter is equally represented in the sample, so I don't think Simpson's paradox can happen here.
One way I think about it is that we are interested in if these hyperparameter themselves are confounder for the complexity and generalization.
Since the hyperparameter is fully observed, we can directly measure the effect of the complexity by conditioning on the hyperparameter.
"Do you agree that using infimum is a better than using expectation? If so then the metric is basically what you suggested with the exception that it looks at the worst case performance and is agnostic to the direction of correlation."
Gotcha. Yeah, I think I can agree with this. Average performance vs worst-case. Being judged on the worst-case is probably fine. Thanks for clarifying this.Posted by: rjohnson0186 @ Sept. 26, 2020, 6:53 p.m.
"I think it's important to note that by construction of the data the only source of randomness is the choice of the hyperparameter. There is no confounder on these hyperparameters because we are setting them. Every hyperparameter is equally represented in the sample, so I don't think Simpson's paradox can happen here."
100% agree here. Also, didn't mean to imply Simpson's paradox is a problem in this contest. Just was using as an example on why causality requires knowledge of the data generating process (aside from some attempts to do this with observational data alone...like looking conditional independences in the data)
By the way, thanks for taking the time to respond. And on that hyperparameters don't directly cause generalization (i.e., you hypothesize there must be an intermediary variable). I will think about this for a bit. Thanks for clarifying.Posted by: rjohnson0186 @ Sept. 26, 2020, 6:58 p.m.