-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kdm/univariateboundaryfix #457
base: master
Are you sure you want to change the base?
Conversation
Thanks for doing this systematically. There are quite a few changes here so it might take some time to reivew. I'll try to do it on Monday. |
@KrisDM Please remove the commits that are not related to this PR. I know it can be useful to work on top of your other branches, but it makes the review harder. |
How do I do that? I have the impression it automatically merged the first PR once I made the second PR. It's not something I did knowingly (except for the last commit, which I did in this branch because I thought they had been merged now). |
My guess is that you started out from the branch that you used for preparing the other PRs. There is probably several ways of fixing this, but I would first copy the changes to a different location. Then I'd reset this branch to origin/master, i.e.
and then add the relevant chages to the branch and then
when you are done. If someone else has a better solution, feel free to suggest it. |
OK, thanks for the tips, will get to it later tonight. |
e4ee8b0
to
834f910
Compare
@andreasnoack I cleaned up the commits, squashing only the univariate-related commits into one so if you don't see any other issues with my fix you can merge the PR straight away. I ended up doing it slightly differently from your suggestions, using git rebase -i origin/master to separate the commits, and then do a git push --force to my branch. You were right on the cause: I now know that my original error was to not start the new branch from master, but from my previously created branch where I had done the multivariate work. |
Closed-opened PR to trigger new Travis build (which had failed for an external reason). |
@KrisDM I'm wondering if we can avoid the extra complexity you add here by using a few floting point tricks. It would be useful if you could give specific examples of the issues you are seeing, but a guess could be that you are e.g. trying to avoid the function logpdf(d::Chi, x::Float64)
ν = d.ν
if insupport(d, x)
return (1.0 - 0.5 * ν) * logtwo + max((ν - 1.0) * log(x), -Inf) - 0.5 * x^2 - lgamma(0.5 * ν)
else
return -Inf
end
end |
@andreasnoack The issues were somewhat different for all distributions that I made changes to. Sometimes it was about avoiding a NaN on the boundary point of the support, as you say, but it doesn't cover all cases. I came across some domain errors been thrown as well. I thought that this was a principled solution because in the mathematical formulation of distributions, some support intervals are closed, some are open, and some are half-open. In the Julia code, however, all support intervals were regarded as closed. I don't think this adds unnecessary complexity. In terms of computations, probably none, as it's likely the compiler inlines everything away. Developers of the distributions package who are adding a new distribution with half-open or open support interval only need to define 1 macro that follows the same pattern as the existing @distr_support macro, and then not worry about what happens at the boundary points. And finally, the users of the Distributions package are entirely shielded. In contrast, the fix you suggest would have to be applied separately to each affected function, and would be different for each of them. It adds at least one and possibly more operations per function (functions that may need to be called often) and makes the code less readable. In my experience, a fix like that always comes back to bite in the future, so I'm not a fan. |
…ions. These errors were partly caused because for some distributions the boundary points of the support are not in the support itself. I solved this by allowing the support to be a closed, half-open or open interval. The default is a closed interval (which corresponds to the old behaviour), meaning that only those distributions with a half-open or open support interval needed to be ammended. The core of this functionality is in the file src/univariates.jl. For half-open or open intervals, the macro @distr_boundaries specifies which boundaries of the interval are open or closed. See e.g., in src/univariate/continuous/lognormal.jl for an example. Tests have been added to test/utils.jl and src/testutils.jl A second set of errors came from sloppy implementations in pdf, logpdf, cdf, ccdf, logcdf or logccdf functions of specific distributions that did not explicitly test if the presented values fall inside or outside the support. Tests that were added in src/testutils.jl flagged - I believe - all instances of such errors, and the errors have been fixed. I spotted 2 additional errors in the support() function in src/univariates.jl. There were no tests for this function so I added them to src/testutils.jl/test_support()
834f910
to
7ce0216
Compare
@andreasnoack I fixed the indentation which was off in two places. By chance I met the developer of Juno yesterday. I asked him to change the default to 4 spaces per tab :-). |
I'm still not sold on this change. I think this should just be handled for each definition of the special functions instead of introducing new types, functions and a macro. This package is already quite heavy in terms of compilation and I don't think it makes the code simpler to understand. If the special functions don't return I wouldn't worry about the Finally, there are some issues with the implementation here which will probably result in a serious slow down. The reason is that the comparison function ends up being determined by a runtime value. One day the compiler might be smart enough to figure out that the support symbols are constant for a specific distribution, but right now you'll see things like julia> @code_warntype insupport(Chisq(1), 2.0)
Variables:
d::Distributions.Chisq
x::Float64
Body:
begin # /Users/andreasnoack/.julia/v0.4/Distributions/src/univariates.jl, line 52:
unless ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(0.0,x::Float64)::Any goto 0
return ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(x::Float64,Distributions.Inf)::Any
0:
return false
end::Any This is one of the reasons why I was talking about complexity. |
@andreasnoack Fair point that the implementation has issues. I thought the compiler would be smart enough to figure it out. I'll see if I can fix it. In general I think that, because the root cause of the problem lies with the insupport function, it needs to be fixed inside the insupport function, and not with a custom hack to each of the calculations, which is error-prone and less maintainable. Another point: a large part of the complexity of my fix is because I wanted that the fix would be equally valid to the auxiliary RealInterval type, which isn't used anywhere inside the Distributions package. I don't see why Distributions is actually exporting this type at all. I checked a handful of downstream packages using Distributions, and have not seen any indication anyone is using it, probably because it was never documented (neither the RealInterval type, nor the function support() which returns a RealInterval, are mentioned in the documentation). So if I am allowed to retire RealInterval and support(), we can get rid of some superfluous complexity in the package, and I'll be able to fix the insupport function far more easily with a uniform fix for all distributions. What do you think? |
Sorry for getting to this so late, I've had other stuff on my plate. Thanks for looking into this, but I have to agree with @andreasnoack, I think that this seems a somewhat over-complicated solution to the problem. The whole open-closed distinction is somewhat meaningless in regards to distributions, since the only cases where it can possibly be different is at |
@simonbyrne Mmh, that's not true though. Quite a few distributions have a limited interval. E.g., the LogNormal has open support interval (0,+Inf). If I call There are several more. Another example, the Levy distribution, has open support interval (mu,+Inf), with mu a parameter of the distribution. As I said in my previous message, the solution would become a lot simpler if I am allowed to retire the undocumented but exported type RealInterval (defined at the top of univariates.jl, but used nowhere in the package). There is another Julia package implementing intervals, and it seems to me that the Distributions package shouldn't be exporting its own. |
See earlier discussion in #150 (comment) |
But I would agree that |
#150 seems to be a somewhat different discussion. Perhaps "support" has subtly different meanings in different branches of mathematics. Each Wikipedia page about a particular distribution defines the support as an open, half-open or closed interval, so it seems to be an accepted concept in mainstream probability theory. These wiki pages about distributions link to another wiki page about support than the one you posted in #150: https://en.wikipedia.org/wiki/Support_(mathematics) If I understand that page well, then, unlike what you write in #150, there is a difference between uniform distributions with support (0.0,1.0) and [0.0,1.0], as the first one should return pdf(0.0) = 0.0, and the second one would return a value > 0.0. I will propose another solution that retires RealInterval and fixes the insupport function for specific distributions whose support has open/half-open intervals (as per their wiki definition). You can then see whether you like the solution or not. If you don't like that solution then so be it, but I have no interest in hacking each pdf, logpdf, cdf, ccdf etc function separately to get rid of the incorrect results. That's not right either. |
I don't see how you get that: the page you link to refers to support of a function, not a distribution (it does mention distributions and measures further down, but points out that there are subtleties involved). |
I agree it's not clear cut, and that sometimes support is indeed referred to as a closure. But then, there seems to be a need to make a distinction between those distributions for which the boundary points return 0.0 for their pdf, and those who don't. Otherwise surely the support interval for the LogNormal wouldn't be defined as an open interval, as here: https://en.wikipedia.org/wiki/Log-normal_distribution (right-hand column) It may be that those pages are actually wrong, or otherwise it's one of the subtleties that the support page glosses over? |
I hesitate to say it, but I think the pages might be wrong. They seem to be treating as "the support of the pdf", but pdfs themselves are only uniquely defined up to null sets. In other words, we can set the value of the pdf arbitrarily at the boundary points, and not change the underlying distribution. There are also more practical reasons for treating the intervals as closed. In the uniform (0,1) case, we would have |
You may be right. On further reading, it seems there is a mixup of "support of the distribution" (the set of values for which PR(x=X) > 0) and the "domain of the pdf" (the values for which the function is defined). For some distributions these are the same, and for others they are not. That mixup is probably what gives rise to the confusion on the wiki pages, but it's also visible in the Julia code itself, whenever the insupport function is used inside the pdf calculation to determine what actually should be the domain. E.g., for the LogNormal, the support is [0,+Inf], whereas the domain is x > 0.0. So to be correct, we actually should not rely on the insupport function inside the calculation of the pdf, but do domain boundary checks which may or may not be different from what the insupport function returns. |
In other words, it seems that pdf(LogNormal(),0.0) or pdf(LogNormal(),-1.0) should not return 0.0, but throw a DomainError? |
Not quite, the support isn't I think the general consensus we have reached is that we should avoid |
You're right, I meant pdf(x) > 0.0. I'll prep a fix when I have time. It seems the most valuable thing I did in the last fix was write the tests that expose the boundary errors. In the fix I propose not relying on insupport inside the calculations for which support of the distribution and the domain of its pdf are different (the first one always being a closed interval if we follow the strict definition of support as closure, the second one not necessarily) Performing the correct boundary tests would be better than fiddling with the calculations. |
what's the status on this one? It needs rebasing from master plus the fixes you mentioned in your last comment @KrisDM |
ping @KrisDM :) |
I'm very sorry but I don't have the time at the moment to look at this. Is
there anything I need to do to "unblock" the issue, i.e., remove a commit
or pull request or anything else? I'm not sure anymore where I left this
two years ago.
Best,
Kris
*---*
*How we can become convinced we're right, even when we're wrong.*
*Web*: http://www.rightbetween.com
*Facebook*: http://facebook.com/rightbtw
*Twitter*: http://twitter.com/@rightbtw
---
Aniku Ltd, London, United Kingdom
Registered in England and Wales, No: 7213769
…On Thu, 28 Mar 2019 at 19:50, Mathieu Besançon ***@***.***> wrote:
ping @KrisDM <https://github.com/KrisDM> :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#457 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFmRSPf_3io0ihCd3XNqkItLTfEQ9u0Uks5vbRzzgaJpZM4HP4Zb>
.
|
Fixes errors in the boundary conditions for univariate distributions.
These errors were partly caused because for some distributions
the boundary points of the support are not in the support itself (in other words, the support is an open or half-open interval).
I solved this by allowing the support to be a closed, half-open or open interval.
The default is a closed interval (which corresponds to the old behaviour), meaning that only
those distributions with a half-open or open support interval needed to be
amended. The core of this functionality is in the file src/univariates.jl.
For half-open or open intervals, the macro @distr_boundaries specifies
which boundaries of the interval are open or closed. See e.g., in
src/univariate/continuous/lognormal.jl for an example.
Tests have been added to test/utils.jl and src/testutils.jl
A second set of errors came from sloppy implementations in pdf, logpdf,
cdf, ccdf, logcdf or logccdf functions of several distributions that did not explicitly test if the
presented values fall inside or outside the support. Tests that I added
in src/testutils.jl flagged - I believe - all instances of such errors,
and the errors have been fixed.