Kdm/univariateboundaryfix #457

KrisDM · 2016-01-30T20:09:21Z

Fixes errors in the boundary conditions for univariate distributions.

These errors were partly caused because for some distributions
the boundary points of the support are not in the support itself (in other words, the support is an open or half-open interval).

I solved this by allowing the support to be a closed, half-open or open interval.
The default is a closed interval (which corresponds to the old behaviour), meaning that only
those distributions with a half-open or open support interval needed to be
amended. The core of this functionality is in the file src/univariates.jl.
For half-open or open intervals, the macro @distr_boundaries specifies
which boundaries of the interval are open or closed. See e.g., in
src/univariate/continuous/lognormal.jl for an example.

Tests have been added to test/utils.jl and src/testutils.jl

A second set of errors came from sloppy implementations in pdf, logpdf,
cdf, ccdf, logcdf or logccdf functions of several distributions that did not explicitly test if the
presented values fall inside or outside the support. Tests that I added
in src/testutils.jl flagged - I believe - all instances of such errors,
and the errors have been fixed.

andreasnoack · 2016-01-30T20:23:37Z

Thanks for doing this systematically. There are quite a few changes here so it might take some time to reivew. I'll try to do it on Monday.

KrisDM · 2016-01-30T20:31:55Z

Great, thanks.

Btw, it somehow seems to have lumped the commits I did for PR #455 into this PR, I don't know how that happened as I created a new branch and a new PR for this. But there are no conflicts: this work fixes errors in univariates, whereas PR #455 is a new multivariate distribution.

andreasnoack · 2016-02-01T17:31:50Z

@KrisDM Please remove the commits that are not related to this PR. I know it can be useful to work on top of your other branches, but it makes the review harder.

KrisDM · 2016-02-01T17:41:05Z

How do I do that? I have the impression it automatically merged the first PR once I made the second PR. It's not something I did knowingly (except for the last commit, which I did in this branch because I thought they had been merged now).

andreasnoack · 2016-02-01T18:04:00Z

My guess is that you started out from the branch that you used for preparing the other PRs.

There is probably several ways of fixing this, but I would first copy the changes to a different location. Then I'd reset this branch to origin/master, i.e.

git fetch origin
git reset --hard origin/master

and then add the relevant chages to the branch and then

git push --force fork kdm/univariateboundaryfix

when you are done. If someone else has a better solution, feel free to suggest it.

KrisDM · 2016-02-01T18:22:30Z

OK, thanks for the tips, will get to it later tonight.

KrisDM · 2016-02-02T11:34:26Z

@andreasnoack I cleaned up the commits, squashing only the univariate-related commits into one so if you don't see any other issues with my fix you can merge the PR straight away.

I ended up doing it slightly differently from your suggestions, using git rebase -i origin/master to separate the commits, and then do a git push --force to my branch.

You were right on the cause: I now know that my original error was to not start the new branch from master, but from my previously created branch where I had done the multivariate work.

KrisDM · 2016-02-02T12:13:43Z

Closed-opened PR to trigger new Travis build (which had failed for an external reason).

andreasnoack · 2016-02-04T01:24:14Z

@KrisDM I'm wondering if we can avoid the extra complexity you add here by using a few floting point tricks. It would be useful if you could give specific examples of the issues you are seeing, but a guess could be that you are e.g. trying to avoid the NaN in logpdf(Chi(1), 0.0). If this case, I think it is sufficient to have something like

function logpdf(d::Chi, x::Float64)
    ν = d.ν
    if insupport(d, x)
        return (1.0 - 0.5 * ν) * logtwo + max((ν - 1.0) * log(x), -Inf) - 0.5 * x^2 - lgamma(0.5 * ν)
    else
        return -Inf
    end
end

KrisDM · 2016-02-04T07:48:15Z

@andreasnoack The issues were somewhat different for all distributions that I made changes to. Sometimes it was about avoiding a NaN on the boundary point of the support, as you say, but it doesn't cover all cases. I came across some domain errors been thrown as well.

I thought that this was a principled solution because in the mathematical formulation of distributions, some support intervals are closed, some are open, and some are half-open. In the Julia code, however, all support intervals were regarded as closed.

I don't think this adds unnecessary complexity. In terms of computations, probably none, as it's likely the compiler inlines everything away. Developers of the distributions package who are adding a new distribution with half-open or open support interval only need to define 1 macro that follows the same pattern as the existing @distr_support macro, and then not worry about what happens at the boundary points. And finally, the users of the Distributions package are entirely shielded.

In contrast, the fix you suggest would have to be applied separately to each affected function, and would be different for each of them. It adds at least one and possibly more operations per function (functions that may need to be called often) and makes the code less readable. In my experience, a fix like that always comes back to bite in the future, so I'm not a fan.

…ions. These errors were partly caused because for some distributions the boundary points of the support are not in the support itself. I solved this by allowing the support to be a closed, half-open or open interval. The default is a closed interval (which corresponds to the old behaviour), meaning that only those distributions with a half-open or open support interval needed to be ammended. The core of this functionality is in the file src/univariates.jl. For half-open or open intervals, the macro @distr_boundaries specifies which boundaries of the interval are open or closed. See e.g., in src/univariate/continuous/lognormal.jl for an example. Tests have been added to test/utils.jl and src/testutils.jl A second set of errors came from sloppy implementations in pdf, logpdf, cdf, ccdf, logcdf or logccdf functions of specific distributions that did not explicitly test if the presented values fall inside or outside the support. Tests that were added in src/testutils.jl flagged - I believe - all instances of such errors, and the errors have been fixed. I spotted 2 additional errors in the support() function in src/univariates.jl. There were no tests for this function so I added them to src/testutils.jl/test_support()

KrisDM · 2016-02-05T19:32:10Z

@andreasnoack I fixed the indentation which was off in two places.

By chance I met the developer of Juno yesterday. I asked him to change the default to 4 spaces per tab :-).

andreasnoack · 2016-02-06T03:09:20Z

I'm still not sold on this change. I think this should just be handled for each definition of the special functions instead of introducing new types, functions and a macro. This package is already quite heavy in terms of compilation and I don't think it makes the code simpler to understand.

If the special functions don't return NaNs or throw then you don't have to care in other places if the interval includes the boundary point or not. It might take some work to go through, but it would be much more work than this PR.

I wouldn't worry about the max. It just a single comparison in a function that calls lgamma and log so it will be almost impossible to measure a difference.

Finally, there are some issues with the implementation here which will probably result in a serious slow down. The reason is that the comparison function ends up being determined by a runtime value. One day the compiler might be smart enough to figure out that the support symbols are constant for a specific distribution, but right now you'll see things like

julia> @code_warntype insupport(Chisq(1), 2.0)
Variables:
  d::Distributions.Chisq
  x::Float64

Body:
  begin  # /Users/andreasnoack/.julia/v0.4/Distributions/src/univariates.jl, line 52:
      unless ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(0.0,x::Float64)::Any goto 0
      return ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(x::Float64,Distributions.Inf)::Any
      0: 
      return false
  end::Any

This is one of the reasons why I was talking about complexity.

KrisDM · 2016-02-06T13:59:24Z

@andreasnoack Fair point that the implementation has issues. I thought the compiler would be smart enough to figure it out. I'll see if I can fix it.

In general I think that, because the root cause of the problem lies with the insupport function, it needs to be fixed inside the insupport function, and not with a custom hack to each of the calculations, which is error-prone and less maintainable.

Another point: a large part of the complexity of my fix is because I wanted that the fix would be equally valid to the auxiliary RealInterval type, which isn't used anywhere inside the Distributions package. I don't see why Distributions is actually exporting this type at all. I checked a handful of downstream packages using Distributions, and have not seen any indication anyone is using it, probably because it was never documented (neither the RealInterval type, nor the function support() which returns a RealInterval, are mentioned in the documentation). So if I am allowed to retire RealInterval and support(), we can get rid of some superfluous complexity in the package, and I'll be able to fix the insupport function far more easily with a uniform fix for all distributions.

What do you think?

simonbyrne · 2016-02-08T11:32:47Z

Sorry for getting to this so late, I've had other stuff on my plate.

Thanks for looking into this, but I have to agree with @andreasnoack, I think that this seems a somewhat over-complicated solution to the problem. The whole open-closed distinction is somewhat meaningless in regards to distributions, since the only cases where it can possibly be different is at +/-Inf (and even then, I don't really see any reason why Inf couldn't be in the support anyway).

KrisDM · 2016-02-08T12:58:55Z

@simonbyrne Mmh, that's not true though. Quite a few distributions have a limited interval. E.g., the LogNormal has open support interval (0,+Inf).

If I call insupport(LogNormal(),0.0), I would expect it to return false, not true as it does now.

There are several more. Another example, the Levy distribution, has open support interval (mu,+Inf), with mu a parameter of the distribution.

As I said in my previous message, the solution would become a lot simpler if I am allowed to retire the undocumented but exported type RealInterval (defined at the top of univariates.jl, but used nowhere in the package). There is another Julia package implementing intervals, and it seems to me that the Distributions package shouldn't be exporting its own.

simonbyrne · 2016-02-08T13:02:48Z

See earlier discussion in #150 (comment)

simonbyrne · 2016-02-08T13:04:00Z

But I would agree that RealInterval isn't much use as it stands.

KrisDM · 2016-02-08T14:11:16Z

#150 seems to be a somewhat different discussion.

Perhaps "support" has subtly different meanings in different branches of mathematics. Each Wikipedia page about a particular distribution defines the support as an open, half-open or closed interval, so it seems to be an accepted concept in mainstream probability theory.

These wiki pages about distributions link to another wiki page about support than the one you posted in #150: https://en.wikipedia.org/wiki/Support_(mathematics)

If I understand that page well, then, unlike what you write in #150, there is a difference between uniform distributions with support (0.0,1.0) and [0.0,1.0], as the first one should return pdf(0.0) = 0.0, and the second one would return a value > 0.0.

I will propose another solution that retires RealInterval and fixes the insupport function for specific distributions whose support has open/half-open intervals (as per their wiki definition). You can then see whether you like the solution or not. If you don't like that solution then so be it, but I have no interest in hacking each pdf, logpdf, cdf, ccdf etc function separately to get rid of the incorrect results. That's not right either.

simonbyrne · 2016-02-08T14:48:05Z

These wiki pages about distributions link to another wiki page about support than the one you posted in #150: https://en.wikipedia.org/wiki/Support_(mathematics)

If I understand that page well, then, unlike what you write in #150, there is a difference between uniform distributions with support (0.0,1.0) and [0.0,1.0], as the first one should return pdf(0.0) = 0.0, and the second one would return a value > 0.0.

I don't see how you get that: the page you link to refers to support of a function, not a distribution (it does mention distributions and measures further down, but points out that there are subtleties involved).

KrisDM · 2016-02-08T15:26:07Z

I agree it's not clear cut, and that sometimes support is indeed referred to as a closure.

But then, there seems to be a need to make a distinction between those distributions for which the boundary points return 0.0 for their pdf, and those who don't. Otherwise surely the support interval for the LogNormal wouldn't be defined as an open interval, as here: https://en.wikipedia.org/wiki/Log-normal_distribution (right-hand column)

It may be that those pages are actually wrong, or otherwise it's one of the subtleties that the support page glosses over?

simonbyrne · 2016-02-08T15:44:34Z

I hesitate to say it, but I think the pages might be wrong. They seem to be treating as "the support of the pdf", but pdfs themselves are only uniquely defined up to null sets. In other words, we can set the value of the pdf arbitrarily at the boundary points, and not change the underlying distribution.

There are also more practical reasons for treating the intervals as closed. In the uniform (0,1) case, we would have x = 1e-20 inside the support, but 1-x outside the support.

KrisDM · 2016-02-08T16:17:39Z

You may be right. On further reading, it seems there is a mixup of "support of the distribution" (the set of values for which PR(x=X) > 0) and the "domain of the pdf" (the values for which the function is defined). For some distributions these are the same, and for others they are not.

That mixup is probably what gives rise to the confusion on the wiki pages, but it's also visible in the Julia code itself, whenever the insupport function is used inside the pdf calculation to determine what actually should be the domain. E.g., for the LogNormal, the support is [0,+Inf], whereas the domain is x > 0.0.

So to be correct, we actually should not rely on the insupport function inside the calculation of the pdf, but do domain boundary checks which may or may not be different from what the insupport function returns.

KrisDM · 2016-02-08T16:20:38Z

In other words, it seems that pdf(LogNormal(),0.0) or pdf(LogNormal(),-1.0) should not return 0.0, but throw a DomainError?

simonbyrne · 2016-02-08T16:35:40Z

Not quite, the support isn't P(X=x) > 0 (otherwise continuous distributions would have no support).

I think the general consensus we have reached is that we should avoid DomainErrors if possible. i.e. return pdf(d,x) == 0 for x outside the support. In the lognormal case, it would also make sense to have pdf(d,0) == 0, as that makes the pdf continuous, but for this may not be the case for all distributions (e.g. Exponential).

KrisDM · 2016-02-09T13:10:09Z

You're right, I meant pdf(x) > 0.0. I'll prep a fix when I have time. It seems the most valuable thing I did in the last fix was write the tests that expose the boundary errors.

In the fix I propose not relying on insupport inside the calculations for which support of the distribution and the domain of its pdf are different (the first one always being a closed interval if we follow the strict definition of support as closure, the second one not necessarily) Performing the correct boundary tests would be better than fiddling with the calculations.

matbesancon · 2018-09-19T15:01:38Z

what's the status on this one? It needs rebasing from master plus the fixes you mentioned in your last comment @KrisDM

matbesancon · 2019-03-28T19:50:09Z

ping @KrisDM :)

KrisDM · 2019-03-29T09:07:12Z

I'm very sorry but I don't have the time at the moment to look at this. Is there anything I need to do to "unblock" the issue, i.e., remove a commit or pull request or anything else? I'm not sure anymore where I left this two years ago. Best, Kris *---* *How we can become convinced we're right, even when we're wrong.* *Web*: http://www.rightbetween.com *Facebook*: http://facebook.com/rightbtw *Twitter*: http://twitter.com/@rightbtw --- Aniku Ltd, London, United Kingdom Registered in England and Wales, No: 7213769

…

On Thu, 28 Mar 2019 at 19:50, Mathieu Besançon ***@***.***> wrote: ping @KrisDM <https://github.com/KrisDM> :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#457 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFmRSPf_3io0ihCd3XNqkItLTfEQ9u0Uks5vbRzzgaJpZM4HP4Zb> .

KrisDM mentioned this pull request Jan 30, 2016

LogNormal inconsistent when outside support #456

Closed

KrisDM force-pushed the kdm/univariateboundaryfix branch from e4ee8b0 to 834f910 Compare February 2, 2016 11:27

KrisDM closed this Feb 2, 2016

KrisDM reopened this Feb 2, 2016

KrisDM force-pushed the kdm/univariateboundaryfix branch from 834f910 to 7ce0216 Compare February 5, 2016 19:29

ludkinm mentioned this pull request Sep 26, 2017

Fixed NaN for Multinomial degenerate cases + tests. #670

Merged

aplavin mentioned this pull request Sep 1, 2024

support intervals #1809

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kdm/univariateboundaryfix #457

Kdm/univariateboundaryfix #457

KrisDM commented Jan 30, 2016

andreasnoack commented Jan 30, 2016

KrisDM commented Jan 30, 2016

andreasnoack commented Feb 1, 2016

KrisDM commented Feb 1, 2016

andreasnoack commented Feb 1, 2016

KrisDM commented Feb 1, 2016

KrisDM commented Feb 2, 2016

KrisDM commented Feb 2, 2016

andreasnoack commented Feb 4, 2016

KrisDM commented Feb 4, 2016

KrisDM commented Feb 5, 2016

andreasnoack commented Feb 6, 2016

KrisDM commented Feb 6, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 9, 2016

matbesancon commented Sep 19, 2018

matbesancon commented Mar 28, 2019

KrisDM commented Mar 29, 2019 via email

Kdm/univariateboundaryfix #457

Are you sure you want to change the base?

Kdm/univariateboundaryfix #457

Conversation

KrisDM commented Jan 30, 2016

andreasnoack commented Jan 30, 2016

KrisDM commented Jan 30, 2016

andreasnoack commented Feb 1, 2016

KrisDM commented Feb 1, 2016

andreasnoack commented Feb 1, 2016

KrisDM commented Feb 1, 2016

KrisDM commented Feb 2, 2016

KrisDM commented Feb 2, 2016

andreasnoack commented Feb 4, 2016

KrisDM commented Feb 4, 2016

KrisDM commented Feb 5, 2016

andreasnoack commented Feb 6, 2016

KrisDM commented Feb 6, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 8, 2016

KrisDM commented Feb 8, 2016

simonbyrne commented Feb 8, 2016

KrisDM commented Feb 9, 2016

matbesancon commented Sep 19, 2018

matbesancon commented Mar 28, 2019

KrisDM commented Mar 29, 2019 via email