Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDI naming #2306

Open
sboukortt opened this issue Jan 13, 2024 · 7 comments
Open

HDI naming #2306

sboukortt opened this issue Jan 13, 2024 · 7 comments

Comments

@sboukortt
Copy link

sboukortt commented Jan 13, 2024

Hi,

I’m opening this because it’s not really clear why, in ArviZ’s output, the bounds for, say, a 94% HDI are named hdi_3% and hdi_97%. Such names seem to imply that they correspond respectively to the 3rd and 97th percentiles of the posterior distribution, which would make them equal-tailed credible intervals if that were the case, but it isn’t necessarily. (HDIs and equal-tailed credible intervals don’t have to coincide.) Indeed, the following program shows a counterexample:

import arviz as az
import pymc as pm
import scipy.stats

with pm.Model() as model:
    pm.Exponential('expon', scale=100)
    trace = pm.sample(10000)

summary = az.summary(trace, hdi_prob=.94)
print(summary)

hdi_lower = summary['hdi_3%'].expon
hdi_upper = summary['hdi_97%'].expon

dist = scipy.stats.expon(scale=100)

print(f"\nThe 94% HDI ([{hdi_lower:.3f}, {hdi_upper:.3f}]) contains {dist.cdf(hdi_upper) - dist.cdf(hdi_lower):.3%} of the probability.")
print(f"{dist.cdf(hdi_lower):.3%} of the probability is to the left of hdi_3%.")
print(f"{dist.sf(hdi_upper):.3%} of the probability is to the right of hdi_97%. ({dist.cdf(hdi_upper):.3%} to its left.)\n")

print(f"The equal-tailed 94% credible interval is [{dist.ppf(0.03):.3f}, {dist.ppf(0.97):.3f}].")

Here is its output:

          mean       sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
expon  100.309  100.502   0.001  281.632       0.79    0.559   10051.0    8522.0    1.0

The 94% HDI ([0.001, 281.632]) contains 94.016% of the probability.
0.001% of the probability is to the left of hdi_3%.
5.983% of the probability is to the right of hdi_97%. (94.017% to its left.)

The equal-tailed 94% credible interval is [3.046, 350.656].

Is there maybe another reason for the naming? Or should it perhaps be reconsidered?

Thanks.

@aloctavodia
Copy link
Contributor

Thanks for reporting this. Probably a better name would be hdi_lb (lower bound) and hdi_up (upper bound). Do you want to submit a PR?

@sethaxen
Copy link
Member

sethaxen commented Feb 1, 2024

A downside to the proposal of hdi_lb and hdi_ub is that the column names no longer inform which HDI was included. Would it be crazy if we have a single hdi_94% column that contains a 2-tuple or length 2 list with the bounds?

@aloctavodia
Copy link
Contributor

aloctavodia commented Feb 2, 2024

And what about something like hdi_94_lb? The downside is that is maybe too long. Having tuples save horizontal space.

@sethaxen
Copy link
Member

sethaxen commented Feb 2, 2024

In arviz, I think numbers would be shown to 3 decimal points, so if they're on unit scale, we have 4 digits plus decimal plus maybe a sign plus whitespace for 12-14 total characters for the column contents. If using a tuple, this is 14-16 characters. While hdi_94% would just take 7 characters, using column names hdi_94_lb and hdi_94_ub would take 20 characters. hdi94_lb would be better, but still, this is 2-4 characters more than the tuple.

In PosteriorStats.jl, we show 3 significant figures, so the tuple takes 12-14 characters, so I'm more strongly in favor of tuples there. Plus, table display in Julia truncates columns from the right instead of the middle like pandas, so I'm more strongly in favor of tuples in PosteriorStats. Here, I'm less certain.

@tomicapretto
Copy link
Contributor

I'm in general not in favor of storing sequences in a cell of a data frame. However, I think this case is a good exception, so I vote in favor of the tuple. Having the column name be something like hdi_{percent}% is better than the other alternatives I can think of.

@sandstromviktor
Copy link

I think storing a tuple with the interval is the most clean way to do it. Then there will be no confusion at all what this represents.

@aloctavodia
Copy link
Contributor

Storing a tuple is the most favored option. We should go with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants