Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support of binary variables in conditional sampling #126

Open
jpaillard opened this issue Jan 14, 2025 · 10 comments
Open

Better support of binary variables in conditional sampling #126

jpaillard opened this issue Jan 14, 2025 · 10 comments
Labels
enhancement New feature or request method implementation Question regarding methods implementations

Comments

@jpaillard
Copy link
Collaborator

The conditional permutation step of CPI is designed for continuous variables where the residual is intuitive to compute and shuffle. However, it is not adapted for binary & ordinal variables. Using the predict_proba method of the imputation model would make more sense in that case.

I wonder if this also applies to knockoffs

@jpaillard jpaillard added enhancement New feature or request method implementation Question regarding methods implementations labels Jan 14, 2025
@jpaillard
Copy link
Collaborator Author

X_perm_j[:, :, group_ids] = X_j_hat[np.newaxis, :, :] + residual_j_perm

I suggest that at this line ⬆️ we create a disjunction ⬇️

if is_classifier(imputation_model):
    X_j_hat_proba = imputation_model.predict_proba(X_minus_j)
    X_perm_j[:, :, group_ids] = rng.bernouilli(X_j_hat_proba)

@lionelkusch
Copy link
Collaborator

In this case, it's better to move the computation of the residual (170) in the branch "if" and start the "if" line 176 for separating the classification problem and regression problems.

However, your proposition changes the type of algorithm from a permutation algorithm to a generator algorithm for classification problems. I am not in favour of it.
Moreover, like the discussion in the issue #117, CPI should not be used for classification problems because the computation of residuals is not well defined.

@jpaillard
Copy link
Collaborator Author

  • I agree with the first point.
  • Regarding the second point, I was actually referring to the possibility of using binary variables (not targets). The application could be both regression and classification. The motivation of this issue is that the conditional sampling of ($X^j | X^{(-j)}$) must ensure that the generated variable is in { $0,1$ }.
  • If I remember well, A.Chamma studied CPI in the classification case as well.

@lionelkusch
Copy link
Collaborator

@achamma723 @bthirion

@jpaillard
Copy link
Collaborator Author

also ping @AngelReyero

@AngelReyero
Copy link
Collaborator

I agree with Joseph for the conditional sampling. The important part of doing the permutation in the CPI is just that under some assumptions (the residuals have the same distribution, therefore independent of $X_{-j}$), we are sampling from the conditional distribution. Therefore, what we want to do in the CPI is to compare the performance when the information exclusively added by the $X_j$ covariate is hidden, which can be done with the conditional sampling of @jpaillard.

@bthirion
Copy link
Contributor

Normally CPI should handle also classification, as long as we can compute a meaningful loss (cross-entropy).

@achamma723
Copy link
Contributor

Hello everyone!
@jpaillard as I recall, when first implementing the CPI conditional step, the 3 types of the variables were considered (binary, oridnal and continuous). However, a quick look on the current implementation of the cpi is keeping just the continuous part (correct me if I'm wrong).

@jpaillard
Copy link
Collaborator Author

Yes, that's correct I could find something similar in your code: https://github.com/achamma723/Variable_Importance/blob/3f007d75a851acba17a2ae1d067857c3e3fffa6f/BBI_package/src/BBI/compute_importance.py#L400-L442
I overlooked this point when I re-formatted the code.

@lionelkusch
Copy link
Collaborator

if you want to be close to the code of Ahmad, you should add 2 parameters:

  • one parameter to define the type of variable of each column of X and y
  • one parameter to define the type of estimator used for each type of variable

If you can, I think that it's better to use the numpy's function choice than bernouilli for the generation of the new sample from the conditional distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request method implementation Question regarding methods implementations
Projects
None yet
Development

No branches or pull requests

5 participants