Better support of binary variables in conditional sampling #126

jpaillard · 2025-01-14T08:49:08Z

The conditional permutation step of CPI is designed for continuous variables where the residual is intuitive to compute and shuffle. However, it is not adapted for binary & ordinal variables. Using the predict_proba method of the imputation model would make more sense in that case.

I wonder if this also applies to knockoffs

The text was updated successfully, but these errors were encountered:

jpaillard · 2025-01-14T15:37:37Z

hidimstat/src/hidimstat/cpi.py

Line 180 in def7341

X_perm_j[:, :, group_ids] = X_j_hat[np.newaxis, :, :] + residual_j_perm

I suggest that at this line ⬆️ we create a disjunction ⬇️

if is_classifier(imputation_model):
    X_j_hat_proba = imputation_model.predict_proba(X_minus_j)
    X_perm_j[:, :, group_ids] = rng.bernouilli(X_j_hat_proba)

lionelkusch · 2025-01-14T15:59:31Z

In this case, it's better to move the computation of the residual (170) in the branch "if" and start the "if" line 176 for separating the classification problem and regression problems.

However, your proposition changes the type of algorithm from a permutation algorithm to a generator algorithm for classification problems. I am not in favour of it.
Moreover, like the discussion in the issue #117, CPI should not be used for classification problems because the computation of residuals is not well defined.

jpaillard · 2025-01-14T16:25:26Z

I agree with the first point.
Regarding the second point, I was actually referring to the possibility of using binary variables (not targets). The application could be both regression and classification. The motivation of this issue is that the conditional sampling of ($X^j | X^{(-j)}$) must ensure that the generated variable is in { $0,1$ }.
If I remember well, A.Chamma studied CPI in the classification case as well.

lionelkusch · 2025-01-14T16:36:46Z

@achamma723 @bthirion

jpaillard · 2025-01-14T16:55:33Z

also ping @AngelReyero

AngelReyero · 2025-01-14T19:04:04Z

I agree with Joseph for the conditional sampling. The important part of doing the permutation in the CPI is just that under some assumptions (the residuals have the same distribution, therefore independent of $X_{-j}$), we are sampling from the conditional distribution. Therefore, what we want to do in the CPI is to compare the performance when the information exclusively added by the $X_j$ covariate is hidden, which can be done with the conditional sampling of @jpaillard.

bthirion · 2025-01-14T21:17:47Z

Normally CPI should handle also classification, as long as we can compute a meaningful loss (cross-entropy).

achamma723 · 2025-01-14T22:20:27Z

Hello everyone!
@jpaillard as I recall, when first implementing the CPI conditional step, the 3 types of the variables were considered (binary, oridnal and continuous). However, a quick look on the current implementation of the cpi is keeping just the continuous part (correct me if I'm wrong).

jpaillard · 2025-01-15T08:08:04Z

Yes, that's correct I could find something similar in your code: https://github.com/achamma723/Variable_Importance/blob/3f007d75a851acba17a2ae1d067857c3e3fffa6f/BBI_package/src/BBI/compute_importance.py#L400-L442
I overlooked this point when I re-formatted the code.

lionelkusch · 2025-01-15T12:40:30Z

if you want to be close to the code of Ahmad, you should add 2 parameters:

one parameter to define the type of variable of each column of X and y
one parameter to define the type of estimator used for each type of variable

If you can, I think that it's better to use the numpy's function choice than bernouilli for the generation of the new sample from the conditional distribution.

jpaillard added enhancement New feature or request method implementation Question regarding methods implementations labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support of binary variables in conditional sampling #126

Better support of binary variables in conditional sampling #126

jpaillard commented Jan 14, 2025

jpaillard commented Jan 14, 2025

lionelkusch commented Jan 14, 2025

jpaillard commented Jan 14, 2025

lionelkusch commented Jan 14, 2025

jpaillard commented Jan 14, 2025

AngelReyero commented Jan 14, 2025

bthirion commented Jan 14, 2025

achamma723 commented Jan 14, 2025

jpaillard commented Jan 15, 2025

lionelkusch commented Jan 15, 2025

Better support of binary variables in conditional sampling #126

Better support of binary variables in conditional sampling #126

Comments

jpaillard commented Jan 14, 2025

jpaillard commented Jan 14, 2025

lionelkusch commented Jan 14, 2025

jpaillard commented Jan 14, 2025

lionelkusch commented Jan 14, 2025

jpaillard commented Jan 14, 2025

AngelReyero commented Jan 14, 2025

bthirion commented Jan 14, 2025

achamma723 commented Jan 14, 2025

jpaillard commented Jan 15, 2025

lionelkusch commented Jan 15, 2025