-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for bound, truncated and censored distributions #1864
Comments
Thanks for the proposal @aseyboldt, I like it. Some thoughts:
I agree that the current implementation is confusing to users and this would be nice API. Would you imagine that there would be a similar wrapper like Bound and the truncated distributions would just be created using that? E.g. in
Sounds good. If there is buy-in from other packages we might also want to consider breaking this out into a separate distributions package.
Purely for perfomance reasons? Because I think theano should be smart enough to pre-compute all constants. Of course this only works if the normalisation does not depend on the parameters. In that case, I wonder if there is some way to detect these terms, in that case we could write a theano graph optimizer that removes them. |
Currently only the log CDF is getting implemented in #1871. |
@domenzain I'm not opposed if there is general utility to having them available. I am just wary of adding (and then maintaining) methods that are rarely used except in highly specialized cases. |
Now that there are a bunch of CDF methods available through #2688, could we try a generic @junpenglao, I see you have worked a lot on |
@domenzain Yes, that sounds great! |
Something like |
This is an old one. Does the |
Yeah I think so |
There have been a couple of issues about these, and I thought it might help to collect them and try to find a good design for truncated and censored distributions.
#1843 #1847 #1833 #1861 #1860
This got a bit longer and less concrete than I wanted, but here it is:
Just to get everyone on the same page about the status quo:
pm.Bound
, which maps a distribution to a different distribution that can't take values outside some interval, but has the same logp in this interval. They aren't strictly speaking probability distributions anymore (the norm is less than 1).cdf / ccdf
We need
lcdf
and/orlccdf
functions for our distributions for both truncation and censoring. #1861 is a start for that. I would propose to add stubs todistribution.Distribution
that raise aNotImplementedError
and then overload those in the actual distributions. We might also want to add something likelinterval
and maybelcinterval
– sometimes there are more stable ways to compute the log of the difference of cdfs.Bound distributions
I'm actually not sure that we should keep those. It seems to me that they could be useful for defining new distributions, but in those cases an interval transform seems more straight forward. There are cases where we can use a bound distribution instead of a truncated one and get the same result, (if the difference of the lpdfs does not depend on any of the parameters), and we might save a bit of time in those cases. But I'm not sure this is worth the additional confusion (see #1833 for an example where it went wrong; and I did something like that in the past myself). To prevent the loss of performance, we could add a parameter
propto
toDistribution.logp
, and drop terms that do not depend on the parameters if it is set toTrue
to get the same effect. (stan does this, and it might save a bit of time in a lot of other circumstances, too).Are there other use-cases that I'm not aware of? If not, I think deprecating them in favour of truncated distributions and removing them in a later release would be fine. Instead, we could also only deprecate the logp function of
Bound
, I think that would prevent most misusage. A third option would be to change their behaviour to the truncated distribution.Truncated distributions
Implementing those seems rather strait forward once we have the lcdfs: Do the same as in Bound, but correct the logp function with the difference of the cdfs at the boundaries and fix the random function. We could follow the same design as
Bound
, but maybe we could use the opportunity to make that a bit more ergonomic. Bound seems a bit confusing to new users right now. Maybe it would be better to predefine distributionsTruncatedNormal
et al or put them in a submoduletruncated.Normal
. They could then have additional parameterslower_trunc
andupper_trunc
. This would also make it much easier to test. I doubt all conceivably useful combinations involvingBound
are working at the moment (eg #1847)Censored distributions
They are tricky. Supporting a couple of simple use cases seems easy enough: Say we analyse a survival experiment involving mice, where the experiment ends after 6 months. We could model that with something like that:
and code censored events by setting them to 6 exactly (which is a null set in the original distribution), or add an additional parameter
is_censored
.Complications:
tldr
Bound
distributions.tuncated
containing truncated (and tested) versions of most distributions. Each with two additional parameterslower_trunc
andupper_trunc
.propto
argument to logp of distribution and drop summands that do not depend on the parameters.The text was updated successfully, but these errors were encountered: