Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1457 (projpred newdata requiring more variables than necessary) #1459

Merged
merged 7 commits into from
Feb 15, 2023

Conversation

fweber144
Copy link
Contributor

This fixes issue #1457. See the commit messages for details. If you don't want to increase the version number, you can revert that of course.

I have also added a TODO comment in this line because I experienced a slight inconsistency while testing this fix: If weights (i.e., from resp_weights()) are missing in newdata (and listed in req_vars), no error is triggered, but such an error is triggered if numbers of trials (i.e., from resp_trials()) are missing in newdata (and listed in req_vars). Already from a conceptual point of view, I think it would be desirable to also throw an error if weights are missing. Tell me if you need a reprex for this (but I think this would be a different issue and furthermore, this is not projpred-related).

This is achieved by not ignoring `extract_y` anymore (ignoring this was already
undesirable from a conceptual point of view) and by setting `check_response`
and `req_vars` appropriately.
@paul-buerkner
Copy link
Owner

Thanks! Why should weights be required if the response is not? They are only relevant for log_lik, which requires the response as well. Posterior_predict for example does not make use of weights.

@fweber144
Copy link
Contributor Author

fweber144 commented Feb 15, 2023

Thanks! Why should weights be required if the response is not? They are only relevant for log_lik, which requires the response as well. Posterior_predict for example does not make use of weights.

I was also thinking about whether the requirement of weights could be tied to the requirement of the response, but projpred uses the weights in proj_predict() (the equivalent to posterior_predict()) in case of a binomial family (where these weights can be regarded as the numbers of trials, but projpred doesn't differentiate these).

Yesterday, this made me think about observation weights in proj_predict() in general, so I had a discussion with @avehtari because of this. He agreed that it would make sense for proj_predict() to take observation weights into account also for families other than binomial. The idea for implementing this is that a weighted dataset would need to be de-aggregated so that instead of a $S_{\text{prj}} \times N$ matrix (with $N$ denoting the number of weighted observations), proj_predict() returns an $S_{\text{prj}} \times \tilde{N}$ matrix, with $\tilde{N}$ denoting the number of de-aggregated observations. Things get a little trickier in case of non-integer observation weights. In that case, a solution could be to sample de-aggregated observations with probabilities proportional to the weights. @avehtari suggested stratified sampling for this, as implemented in the posterior package, for example. But even then, non-integer $\tilde{N}$ remains a problem that I'm not sure yet how to solve (simply round $\tilde{N}$ to the next integer?).

I think incorporating observation weights in proj_predict() and posterior_predict() makes sense, especially if you think of bayesplot::ppd_dens_overlay(). The kernel density estimate across observations that is used there should be influenced by observation weights, I think.

But honestly, I'm not very keen to implement this in projpred soon, as this is a very special case and probably hard to implement in a sensible way in terms of the UI (for example, how to allow users to identify which columns of the $S_{\text{prj}} \times \tilde{N}$ matrix belong to which columns of the $S_{\text{prj}} \times N$ matrix?). For now, I might throw an error or a warning in projpred when using proj_predict() in case of observation weights (apart from the binomial family because things seem to be correct there). But as mentioned above, proj_predict() in the binomial case prevents us from tying the requirement of weights to the requirement of the response.

@fweber144
Copy link
Contributor Author

fweber144 commented Feb 15, 2023

I have to add that for bayesplot::ppd_dens_overlay(), the original response values (y) would also have to be de-aggregated in the same way as for the predictions (y_rep).

@fweber144
Copy link
Contributor Author

I guess that rstantools::posterior_predict() and corresponding rstanarm methods would also need to be changed if we really want to go in this direction.

@paul-buerkner
Copy link
Owner

Okay, so what we need for this PR is that variables that are listed in req_vars are actually required. Fair point. I will check and edit the PR accordingly.

@paul-buerkner
Copy link
Owner

Well actually, I am not sure I want to change this stuff right now and I will have to think about whether this makes any sense at all. I will merge the PR and we can revisit the weight stuff later.

@paul-buerkner paul-buerkner merged commit c793d9f into paul-buerkner:master Feb 15, 2023
@fweber144
Copy link
Contributor Author

Thanks! Revisiting the weights stuff later is ok for me.

@fweber144 fweber144 deleted the newdata_issue branch February 15, 2023 13:53
fweber144 added a commit to fweber144/brms that referenced this pull request Feb 15, 2023
…_vars = character()`

Thus, we need to set `req_vars` to the response variable.
fweber144 added a commit to fweber144/brms that referenced this pull request Feb 15, 2023
…eq_vars = character()`

Thus, we need to set `req_vars` to the response variable.
@fweber144 fweber144 mentioned this pull request Feb 15, 2023
paul-buerkner added a commit that referenced this pull request Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants