Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_rank calculation #68

Open
bnicenboim opened this issue Feb 28, 2022 · 5 comments
Open

max_rank calculation #68

bnicenboim opened this issue Feb 28, 2022 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@bnicenboim
Copy link

I'm a bit confused about max_rank calculation.

I followed the example in Getting Started with SBC, and it's unclear to me how the max_rank is calculated in compute_SBC

In that example, there are 2000 samples, and thinning by default is set to 10, so I should end up with S = 200.
The max_rank = S + 1 = 201, right?

Then it help of the function says:

The number of ranks is one plus the number of posterior samples after thinning - therefore as long as the number of samples is a "nice" number, the number of ranks usually will not be. To remedy this, you can specify ensure_num_ranks_divisor - the method will drop at most ensure_num_ranks_divisor - 1 samples to make the number of ranks divisible by ensure_num_ranks_divisor. The default 2 prevents the most annoying pathologies while discarding at most a single sample.

Then one should remove 1 sample, so that max_rank is 200, or am I wrong somewhere?

Great package by the way, it makes SBC super easy (and quite fast).

@martinmodrak
Copy link
Collaborator

I think the source of the confusion is that posterior ranks start with 0, because "rank is defined as the number of draws < simulated value" so the lowest rank is 0 (no draw smaller than the simulated value), maximum rank is 'D' (all draws smaller than the simulated value) where D is the number of posterior draws after thinning + ensure_num_ranks_divisor. So "number of ranks" is max_rank + 1.

I see how that might be confusing. Do you see a specific place where mentioning some of that detail would have helped you understand what's happening?

It might also be possible to just move to 1-based ranks, but I currently dislike that as that would then clash with how the ranks are defined...

@martinmodrak martinmodrak added the documentation Improvements or additions to documentation label Feb 28, 2022
@bnicenboim
Copy link
Author

I see how that might be confusing. Do you see a specific place where mentioning some of that detail would have helped you understand what's happening?

Under Rank divisors?
IMHO, it would be worth to add a section with explanations of the columns of $stats (which I think should be the default output of printing the object, right now the output of print is extremely long and verbose). All the cols are quite straightforward and could be linked to brms/stan except for rank, z-score and max_rank that deserve a couple of lines of explanation.

@bnicenboim
Copy link
Author

Maybe a silly question, but what would happen if the number of samples is not a "nice" number? I have never worried about this before, and I plot a lot of histograms :)

@martinmodrak
Copy link
Collaborator

One problem with "ugly" number of samples is that it gets hard/impossible to divide your simulations equally between bins: while with 100 ranks (i.e. max_rank = 99), you can choose 2, 4, 5, 10, 20, 25, 50 and 100 bins to get exactly uniform distribution between bins, with max_rank = 100 you get 101 ranks, which is prime and only 101 bins will produce exactly uniform distrubition. In plot_rank_hist we are conservative and if some bins contain more ranks than others, the plotted confidence interval is the union of credible intervals for both the "smaller" and "wider" bins. This can be very noticeable in cases like max_rank = 100, n_sims = 100, bins = 50 where most bins should contain 2 ranks on average, while one should contain 3 ranks, making the approximate CIs much wider than with max_rank = 99, n_sims = 100, bins = 50 where all bins should contain 2 ranks on average.

A smaller problem arises also for the empirical_coverage function where with max_rank = 99, you can compute the coverage of any integer posterior CI exactly which is typically what people care about (50%, 90%, 95%), while with max_rank = 100 you can only compute the coverage of some weird intervals (50.49%, 90.09%, 95.04%). This is definitely not an important difference unless you have a ton of simulations.

In both cases, the differences are almost always minor, it just felt good to do things "right" in the package.

@bnicenboim
Copy link
Author

Understood!

This can be very noticeable in cases like max_rank = 100, n_sims = 100, bins = 50 where most bins should contain 2 ranks on average, while one should contain 3 ranks,

But regardless of that, it would be a terrible histogram to check! This reassures me:

In both cases, the differences are almost always minor, it just felt good to do things "right" in the package.

Anyways, thanks for the explanation :) I'm making this to get off-topic. I guess I'll think more about this in my future histograms, I just started to wonder about all the other histograms I did.

(BTW I see that the decision of how many bins is based anyway on plot_rank_hist() which doesn't have any description).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants