-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Management of not fitted estimator #139
Comments
I think that we should fit the model, since the CPI model istelf gets an (X,y) pair for fitting |
For CPI, there are two models the estimator and the imputation model. In the actual implementation, the imputation model will be fitted but the estimator model requires to be already fitted. The question is only on the estimator and with which data. In the case of CPI, only the train data is provided to the fit method, which can bias the estimator because we normally assume that the estimator is fitted on the train and test data set. |
No, the estimator should always be trained on the train data only. |
After some thinking, I realise that we should not ask y, the target data. What do you think? |
Note sure what you mean.
|
No, if we consider that the estimator needs to be fitted before using any methods, in this case, it's not necessary to require the couple X,y because we can retrieve y with the estimator from X. |
I see that you do not need the target y for the imputation model. Nevertheless, you need it for the importance statistic. Indeed, you need to compare the performance with the permuted input and without the permuted input. The output is not directly given by the prediction on the original covariate. The CPI is not only for explaining the output of a black box model (variable importance), but also to explain the relationship between the output y and the covariates (intrinsic variable importance). |
I agree with the above point. To come back to the initial point, IMHO we should support both. I see 2 different use cases:
|
I would like to expand on the original question, considering whether the target, y, is necessary.
|
It is because even if |
I don't understand what you mean by "a plug-in in the difference is nonparametric efficient". Can you detail it? |
On #139 (comment) ii. I think that this can be addressed by storing the fitted model/scores as attributes of the CPI class. BTW, the user is always free to provide a poorly fitted model. Regarding the purpose of the library, I see a point in "understanding data" in the research / exploratory context where a user can be interested in understanding which variables are influencing a response y. |
Some functions have the possibility of fitting the estimator but it's not the case for all of them and it's the reason that I open this PR. However, this is possible because we require to have the target, y, as an argument of the function. I still wonder why we need the target, y as an argument. |
I agree that the user can provide a poorly fitted model but he needs to be aware of it. |
Yes, I agree that this is a goal of the library but there are multiple ways of doing it. We actually provide only the explanation based on estimators, which is quite limited in my point of view. |
In general, the CPI wants to estimate Something similar happened in https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392 to see that both plug-in estimates do not coincide. Indeed, $\widehat{\psi}\mathrm{LOCO}^j=\frac{1}{n{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(\widehat{m}{-j}(x{i}^{-j})- y_i\right)^2-\left(\widehat{m}(x_{i})-y_i)\right)^2$ is nonparametric efficient, meaning that the bias decreases at an optimal rate with the sample size, but $\widehat{\psi}\mathrm{LOCO}^j=\frac{1}{n{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(\widehat{m}{-j}(x{i}^{-j})-\widehat{m}(x_i)\right)^2$ is not and needs a one-step correction. Finally, even with a good estimate |
OK, from my understanding of your argument, we use the target to reduce the bias introduced by the estimator. My refutation came because I want it to answer a difference question than you. However, I think that your question is more interesting in general and we should provide methods for going in this direction. |
Yes, indeed, that is the difference between feature importance and variable importance (or intrinsic variable importance vs variable importance). I think it is important for that the model is well fitted, otherwise, there is an misspecification (https://arxiv.org/abs/2007.04131). For your question of only interpreting the black model we should use sensitivity analysis. |
When the estimator is not fitted, what do we need to do?
For the moment, I mainly fit the estimator with the data provided but I am not sure that this is a good practice.
I think that we need to provide a message to inform the user of the fitting of the estimator inside the function.
In my actual refactoring of the code, I think that there is a difference in behaviour for the management of not fitted estimators. This needs to be homogenised.
The text was updated successfully, but these errors were encountered: