max_rank calculation #68

bnicenboim · 2022-02-28T12:49:54Z

I'm a bit confused about max_rank calculation.

I followed the example in Getting Started with SBC, and it's unclear to me how the max_rank is calculated in compute_SBC

In that example, there are 2000 samples, and thinning by default is set to 10, so I should end up with S = 200.
The max_rank = S + 1 = 201, right?

Then it help of the function says:

The number of ranks is one plus the number of posterior samples after thinning - therefore as long as the number of samples is a "nice" number, the number of ranks usually will not be. To remedy this, you can specify ensure_num_ranks_divisor - the method will drop at most ensure_num_ranks_divisor - 1 samples to make the number of ranks divisible by ensure_num_ranks_divisor. The default 2 prevents the most annoying pathologies while discarding at most a single sample.

Then one should remove 1 sample, so that max_rank is 200, or am I wrong somewhere?

Great package by the way, it makes SBC super easy (and quite fast).

The text was updated successfully, but these errors were encountered:

martinmodrak · 2022-02-28T13:15:55Z

I think the source of the confusion is that posterior ranks start with 0, because "rank is defined as the number of draws < simulated value" so the lowest rank is 0 (no draw smaller than the simulated value), maximum rank is 'D' (all draws smaller than the simulated value) where D is the number of posterior draws after thinning + ensure_num_ranks_divisor. So "number of ranks" is max_rank + 1.

I see how that might be confusing. Do you see a specific place where mentioning some of that detail would have helped you understand what's happening?

It might also be possible to just move to 1-based ranks, but I currently dislike that as that would then clash with how the ranks are defined...

bnicenboim · 2022-02-28T13:29:03Z

I see how that might be confusing. Do you see a specific place where mentioning some of that detail would have helped you understand what's happening?

Under Rank divisors?
IMHO, it would be worth to add a section with explanations of the columns of $stats (which I think should be the default output of printing the object, right now the output of print is extremely long and verbose). All the cols are quite straightforward and could be linked to brms/stan except for rank, z-score and max_rank that deserve a couple of lines of explanation.

bnicenboim · 2022-02-28T14:00:03Z

Maybe a silly question, but what would happen if the number of samples is not a "nice" number? I have never worried about this before, and I plot a lot of histograms :)

martinmodrak · 2022-02-28T14:59:10Z

One problem with "ugly" number of samples is that it gets hard/impossible to divide your simulations equally between bins: while with 100 ranks (i.e. max_rank = 99), you can choose 2, 4, 5, 10, 20, 25, 50 and 100 bins to get exactly uniform distribution between bins, with max_rank = 100 you get 101 ranks, which is prime and only 101 bins will produce exactly uniform distrubition. In plot_rank_hist we are conservative and if some bins contain more ranks than others, the plotted confidence interval is the union of credible intervals for both the "smaller" and "wider" bins. This can be very noticeable in cases like max_rank = 100, n_sims = 100, bins = 50 where most bins should contain 2 ranks on average, while one should contain 3 ranks, making the approximate CIs much wider than with max_rank = 99, n_sims = 100, bins = 50 where all bins should contain 2 ranks on average.

A smaller problem arises also for the empirical_coverage function where with max_rank = 99, you can compute the coverage of any integer posterior CI exactly which is typically what people care about (50%, 90%, 95%), while with max_rank = 100 you can only compute the coverage of some weird intervals (50.49%, 90.09%, 95.04%). This is definitely not an important difference unless you have a ton of simulations.

In both cases, the differences are almost always minor, it just felt good to do things "right" in the package.

bnicenboim · 2022-02-28T15:41:22Z

Understood!

This can be very noticeable in cases like max_rank = 100, n_sims = 100, bins = 50 where most bins should contain 2 ranks on average, while one should contain 3 ranks,

But regardless of that, it would be a terrible histogram to check! This reassures me:

In both cases, the differences are almost always minor, it just felt good to do things "right" in the package.

Anyways, thanks for the explanation :) I'm making this to get off-topic. I guess I'll think more about this in my future histograms, I just started to wonder about all the other histograms I did.

(BTW I see that the decision of how many bins is based anyway on plot_rank_hist() which doesn't have any description).

martinmodrak added the documentation Improvements or additions to documentation label Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_rank calculation #68

max_rank calculation #68

bnicenboim commented Feb 28, 2022

martinmodrak commented Feb 28, 2022

bnicenboim commented Feb 28, 2022

bnicenboim commented Feb 28, 2022

martinmodrak commented Feb 28, 2022

bnicenboim commented Feb 28, 2022

max_rank calculation #68

max_rank calculation #68

Comments

bnicenboim commented Feb 28, 2022

martinmodrak commented Feb 28, 2022

bnicenboim commented Feb 28, 2022

bnicenboim commented Feb 28, 2022

martinmodrak commented Feb 28, 2022

bnicenboim commented Feb 28, 2022