Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Management of not fitted estimator #139

Open
lionelkusch opened this issue Jan 23, 2025 · 18 comments
Open

Management of not fitted estimator #139

lionelkusch opened this issue Jan 23, 2025 · 18 comments
Labels
coding style question regarding formatting and declaration of functions method implementation Question regarding methods implementations

Comments

@lionelkusch
Copy link
Collaborator

When the estimator is not fitted, what do we need to do?

For the moment, I mainly fit the estimator with the data provided but I am not sure that this is a good practice.
I think that we need to provide a message to inform the user of the fitting of the estimator inside the function.

In my actual refactoring of the code, I think that there is a difference in behaviour for the management of not fitted estimators. This needs to be homogenised.

@lionelkusch lionelkusch added coding style question regarding formatting and declaration of functions method implementation Question regarding methods implementations labels Jan 23, 2025
@bthirion
Copy link
Contributor

I think that we should fit the model, since the CPI model istelf gets an (X,y) pair for fitting
E.g.
https://github.com/mind-inria/hidimstat/blob/main/examples/plot_variable_importance_classif.py#L170

@lionelkusch
Copy link
Collaborator Author

For CPI, there are two models the estimator and the imputation model. In the actual implementation, the imputation model will be fitted but the estimator model requires to be already fitted.

The question is only on the estimator and with which data. In the case of CPI, only the train data is provided to the fit method, which can bias the estimator because we normally assume that the estimator is fitted on the train and test data set.

@bthirion
Copy link
Contributor

No, the estimator should always be trained on the train data only.

@lionelkusch
Copy link
Collaborator Author

After some thinking, I realise that we should not ask y, the target data.
If the aim of the library is to determine what the ML has learned, in this case, the target data can be got directly from the estimator. This implies that the estimator has learned something before using the different methods.

What do you think?
@AngelReyero @jpaillard @bthirion

@bthirion
Copy link
Contributor

bthirion commented Feb 3, 2025

Note sure what you mean.

  • You need y to fit the main estimator that predicts y from X
  • anyhow, sklearn's API always requires to provide X,y jointly even if y is not used (using y=None is fine)
    Does that answer your concern ?

@lionelkusch
Copy link
Collaborator Author

No, if we consider that the estimator needs to be fitted before using any methods, in this case, it's not necessary to require the couple X,y because we can retrieve y with the estimator from X.
I agree that sklearn's API always requires providing X,y because the core of the library is to fit an estimator. In our case, we don't need to fit an estimator, we need to explain it. In consequence, the target, y, is not necessary and it can be retrieved with the estimator.

@AngelReyero
Copy link
Collaborator

I see that you do not need the target y for the imputation model. Nevertheless, you need it for the importance statistic. Indeed, you need to compare the performance with the permuted input and without the permuted input. The output is not directly given by the prediction on the original covariate.

The CPI is not only for explaining the output of a black box model (variable importance), but also to explain the relationship between the output y and the covariates (intrinsic variable importance).

@jpaillard
Copy link
Collaborator

I agree with the above point.

To come back to the initial point, IMHO we should support both. I see 2 different use cases:

  • inspecting a model: the user wants to explain a previously fitted model --> no need to re-fit the full model that might be expensive
  • understanding data: the user wants to understand which variables in X are predictive of y. The user provides a model that presumably explains the relationship between X and y --> the model is not yet fitted and can be seen as a parameter of the inference

@lionelkusch
Copy link
Collaborator Author

I would like to expand on the original question, considering whether the target, y, is necessary.

  • In your first case, "inspecting a model", I don't see the point of having the target, y, because we can generate it by using the fitted model and basing all the analyses on this output. I just wonder if, theoretically, there are some advantages or assumptions of using the target and not the prediction output of the estimator.

  • For the second case, "understand data", in my opinion, it’s not the aim of this library. I see multiple weaknesses to introduce this functionality:

    1. There are multiple ways to train a model (cross-validation, hypertuning, ....)
    2. For a correct understanding, we need to provide some metrics to indicate if the model is correctly trained or not. In my opinion, without these metrics, we can induce the user to some wrong interpretation if the model is not well fitted. Sorry, for adding not necessary files.

@AngelReyero
Copy link
Collaborator

It is because even if $\mathbb{E}((m(\widetilde{X})-y)^2)-\mathbb{E}((m(X)-y)^2)=\mathbb{E}((m(\widetilde{X})-m(X))^2)$, a plug-in estimate does not necessarily gives the same estimate. This is seen for instance for LOCO in https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392, where a plug-in in the difference is nonparametric efficient, but on the right-hand side is not and needs a one-step estimate. More generally, for the CPI we need to compare the performances of both tests-sets: one with the permuted and the original. Therefore, to compute the performance, it is necessary to have the output which is not directly given by $\widehat{m}(X)$.

@lionelkusch
Copy link
Collaborator Author

I don't understand what you mean by "a plug-in in the difference is nonparametric efficient". Can you detail it?
However, I still don't see your point because it seems in contradiction with using the perfect estimator. What I mean is that if I push a bit your reasoning, for a perfect estimator, i.e. $y = m(X)$, we should not use CPI because we can't compare the difference of performance between the test sets.

@jpaillard
Copy link
Collaborator

On #139 (comment)
i. The current implementation supports custom training strategies: for hyper-tuning, you can seamlessly pass a RandomizedSearch object, for instance, as shown in the classification example. Combining CPI with cross-validation is also illustrated and recommended in the example.

ii. I think that this can be addressed by storing the fitted model/scores as attributes of the CPI class. BTW, the user is always free to provide a poorly fitted model.

Regarding the purpose of the library, I see a point in "understanding data" in the research / exploratory context where a user can be interested in understanding which variables are influencing a response y.

@lionelkusch
Copy link
Collaborator Author

The current implementation supports custom training strategies: for hyper-tuning, you can seamlessly pass a RandomizedSearch object, for instance, as shown in the classification example. Combining CPI with cross-validation is also illustrated and recommended in the example.

Some functions have the possibility of fitting the estimator but it's not the case for all of them and it's the reason that I open this PR. However, this is possible because we require to have the target, y, as an argument of the function. I still wonder why we need the target, y as an argument.
Moreover, your example is not representative of all the methods because the fitting is done before applying any methods, which are not for all the methods. In my opinion, this should be the way of using the methods and we shouldn't include/hide the fitting of the estimator in the method itself. Moreover, I am still wonder why we should include the target in the methods and not use the estimator for generating it.

@lionelkusch
Copy link
Collaborator Author

ii. I think that this can be addressed by storing the fitted model/scores as attributes of the CPI class. BTW, the user is always free to provide a poorly fitted model.

I agree that the user can provide a poorly fitted model but he needs to be aware of it.
For storing/providing the estimator and its scoring, it can be an option but it will increase the complexity of the code for a feature, which, in my opinion, is not essential.

@lionelkusch
Copy link
Collaborator Author

Regarding the purpose of the library, I see a point in "understanding data" in the research / exploratory context where a user can be interested in understanding which variables are influencing a response y.

Yes, I agree that this is a goal of the library but there are multiple ways of doing it. We actually provide only the explanation based on estimators, which is quite limited in my point of view.

@AngelReyero
Copy link
Collaborator

AngelReyero commented Feb 4, 2025

I don't understand what you mean by "a plug-in in the difference is nonparametric efficient". Can you detail it? However, I still don't see your point because it seems in contradiction with using the perfect estimator. What I mean is that if I push a bit your reasoning, for a perfect estimator, i.e. y = m ( X ) , we should not use CPI because we can't compare the difference of performance between the test sets.

In general, the CPI wants to estimate $\psi_{TSI}(j, P_0):=\mathbb{E}{\mathcal{L}\left(m(\widetilde{X}), y\right)}-\mathbb{E}{\mathcal{L}(m(X), y)}$, where $\mathcal{L}: \mathcal{Y}'\times\mathcal{Y}\to \mathbb{R}$ is the loss and $\widetilde{X}$ stands for the conditional sampling. The underlying distribution is unknown, so it has to be estimated with a plug-in estimate given by $ \widehat{\psi}{\mathrm{CPI}}^j=\frac{1}{n{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\mathcal{L}(\widehat{m}(\widetilde{x}{i}'^{(j)}),y_i-\mathcal{L}\left(\widehat{m}(x{i}), y_i\right)$. This is not the same as $ \frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\mathcal{L}\left(\widehat{m}(\widetilde{x}{i}'^{(j)}),\widehat{m}(x{i})\right)$. First, note that $\mathcal{Y}$ is not forcelly $\mathcal{Y}'$ (see for instance cross entropy loss).

Something similar happened in https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392 to see that both plug-in estimates do not coincide. Indeed, $\widehat{\psi}\mathrm{LOCO}^j=\frac{1}{n{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(\widehat{m}{-j}(x{i}^{-j})- y_i\right)^2-\left(\widehat{m}(x_{i})-y_i)\right)^2$ is nonparametric efficient, meaning that the bias decreases at an optimal rate with the sample size, but $\widehat{\psi}\mathrm{LOCO}^j=\frac{1}{n{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(\widehat{m}{-j}(x{i}^{-j})-\widehat{m}(x_i)\right)^2$ is not and needs a one-step correction.

Finally, even with a good estimate $m$, we do not have $y=m(X)$, even in regression we assume $y=m(X)+\epsilon$.

@lionelkusch
Copy link
Collaborator Author

OK, from my understanding of your argument, we use the target to reduce the bias introduced by the estimator.

My refutation came because I want it to answer a difference question than you.
If I am correct, you want to answer: "What are the variables of importance for the data?"
In my case, I want it to answer: "What does the model learn as a variable of importance?"
Consequence, in my case, to be biased by the estimator wasn't an issue.

However, I think that your question is more interesting in general and we should provide methods for going in this direction.

@AngelReyero
Copy link
Collaborator

Yes, indeed, that is the difference between feature importance and variable importance (or intrinsic variable importance vs variable importance). I think it is important for that the model is well fitted, otherwise, there is an misspecification (https://arxiv.org/abs/2007.04131). For your question of only interpreting the black model we should use sensitivity analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coding style question regarding formatting and declaration of functions method implementation Question regarding methods implementations
Projects
None yet
Development

No branches or pull requests

4 participants