Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate HDRs #123

Open
robjhyndman opened this issue Aug 12, 2024 · 5 comments
Open

Inaccurate HDRs #123

robjhyndman opened this issue Aug 12, 2024 · 5 comments

Comments

@robjhyndman
Copy link
Contributor

library(distributional)
hdr(dist_normal())
#> <hdr[1]>
#> [1] [-1.95194, 1.95194]95
qnorm(0.975)
#> [1] 1.959964

Created on 2024-08-12 with reprex v2.1.1

@mitchelloharawild
Copy link
Owner

mitchelloharawild commented Aug 12, 2024

This is largely based on the grid size of the quantiles (same as is done in hdrcde).

Gradient descent based methods would produce more accurate solutions starting from suitable hdrs from the grid method.

library(distributional)
hdr(dist_normal(), n = 1e6)
#> <hdr[1]>
#> [1] [-1.959956, 1.959956]95
qnorm(0.975)
#> [1] 1.959964

Created on 2024-08-12 with reprex v2.1.0

@robjhyndman
Copy link
Contributor Author

Maybe set n a little higher. quantiles should be relatively quick to compute for most distributions. Setting n=1e5 is still relatively quick for dist_normal() I think 4 accurate significant figures is a reasonable objective.

@mitchelloharawild
Copy link
Owner

What do you think about root finding algorithms for this, starting from a suitable location chosen from a course grid?

@robjhyndman
Copy link
Contributor Author

Maybe. Perhaps try a speed comparison on dist_normal(). I've tested the effect of grid size on accuracy, and it has some really weird numerical behaviour.

library(distributional)
library(ggplot2)
df <- data.frame(
  n = exp(seq(log(50), log(5e3), length = 1e3)),
  upper = NA_real_
)
for (i in seq(NROW(df))) {
  df$upper[i] <- unlist(hdr(dist_normal(), n = df$n[i]))["upper"]
}
df |>
  ggplot(aes(x = n, y = upper)) +
  geom_line() +
  geom_hline(yintercept = qnorm(0.975), col = "red") +
  scale_x_log10()

Created on 2024-09-17 with reprex v2.1.1

@robjhyndman
Copy link
Contributor Author

robjhyndman commented Sep 17, 2024

I started writing a root finding function, and then realised why I didn't do it this way in the first place. Every time you do a calculation of the HDR coverage for a generic continuous distribution, you need to compute the density at a large number of observations, because the algorithm uses Monte Carlo integration. So with root finding, it would require far more computations than having a fine grid.

Of course, we could have hdr.dist_xxx() functions for specific distributions that would be much faster. For symmetric distributions the computation is easily done from the cdf. So perhaps we could at least add hdr.dist_normal() and similar for other symmetric distributions, and change the value of n in hdr.dist_default() to about 5000?

I can do a PR along these lines if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants