Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for bound, truncated and censored distributions #1864

Closed
aseyboldt opened this issue Mar 6, 2017 · 8 comments
Closed

Design for bound, truncated and censored distributions #1864

aseyboldt opened this issue Mar 6, 2017 · 8 comments

Comments

@aseyboldt
Copy link
Member

There have been a couple of issues about these, and I thought it might help to collect them and try to find a good design for truncated and censored distributions.

#1843 #1847 #1833 #1861 #1860

This got a bit longer and less concrete than I wanted, but here it is:

Just to get everyone on the same page about the status quo:

  • "Bound" distributions: These are implemented as pm.Bound, which maps a distribution to a different distribution that can't take values outside some interval, but has the same logp in this interval. They aren't strictly speaking probability distributions anymore (the norm is less than 1).
  • Truncated distributions: Same as "Bound", but with a correction to the logp to make the norm 1. We do not have a build-in method of defining those (it's possible with a potential or a custom distribution).
  • Censored distributions: Values in a subset of the support are reported, values outside are only counted. We might want to support right/left/interval censoring. There is no current direct support for it.

cdf / ccdf

We need lcdf and/or lccdf functions for our distributions for both truncation and censoring. #1861 is a start for that. I would propose to add stubs to distribution.Distribution that raise a NotImplementedError and then overload those in the actual distributions. We might also want to add something like linterval and maybe lcinterval – sometimes there are more stable ways to compute the log of the difference of cdfs.

Bound distributions

I'm actually not sure that we should keep those. It seems to me that they could be useful for defining new distributions, but in those cases an interval transform seems more straight forward. There are cases where we can use a bound distribution instead of a truncated one and get the same result, (if the difference of the lpdfs does not depend on any of the parameters), and we might save a bit of time in those cases. But I'm not sure this is worth the additional confusion (see #1833 for an example where it went wrong; and I did something like that in the past myself). To prevent the loss of performance, we could add a parameter propto to Distribution.logp, and drop terms that do not depend on the parameters if it is set to True to get the same effect. (stan does this, and it might save a bit of time in a lot of other circumstances, too).
Are there other use-cases that I'm not aware of? If not, I think deprecating them in favour of truncated distributions and removing them in a later release would be fine. Instead, we could also only deprecate the logp function of Bound, I think that would prevent most misusage. A third option would be to change their behaviour to the truncated distribution.

Truncated distributions

Implementing those seems rather strait forward once we have the lcdfs: Do the same as in Bound, but correct the logp function with the difference of the cdfs at the boundaries and fix the random function. We could follow the same design as Bound, but maybe we could use the opportunity to make that a bit more ergonomic. Bound seems a bit confusing to new users right now. Maybe it would be better to predefine distributions TruncatedNormal et al or put them in a submodule truncated.Normal. They could then have additional parameters lower_trunc and upper_trunc. This would also make it much easier to test. I doubt all conceivably useful combinations involving Bound are working at the moment (eg #1847)

Censored distributions

They are tricky. Supporting a couple of simple use cases seems easy enough: Say we analyse a survival experiment involving mice, where the experiment ends after 6 months. We could model that with something like that:

Survival = pm.Censored(pm.Weibull, upper=6)
Survival('months', ..., observed=data)

and code censored events by setting them to 6 exactly (which is a null set in the original distribution), or add an additional parameter is_censored.

Complications:

  • We need to be careful to not break posterior sampling with something like that.
  • Censoring can be much more complicated if we allow the censoring events themselves to be variables. Say we want to model a cancer treatment, where no patients drop out, but from time to time a patient is hit by a stray meteorite – not related to the treatment at all (and sorry for the example, that's what happens if you spend time with biologists and astronomers). Then the time to a meteorite hit – a censoring event – follows an exponential distribution. But since some patients die before we observe that, we end up with a censored distribution for the meteor hits and the normal deaths. And in real experiments the censoring events might not even be independent from the actual event of interest.
  • There are different forms of censoring: right / left / interval and combinations thereof.
  • Censored distributions are not absolutely continuous. Maybe it helps to look at them as a mixture of a truncated distribution and a discrete distribution.

tldr

  • Deprecate Bound
  • add submodule distributions.tuncated containing truncated (and tested) versions of most distributions. Each with two additional parameters lower_trunc and upper_trunc.
  • Implement lcdf, lccdf, linterval, lcinterval for most distributions
  • Add propto argument to logp of distribution and drop summands that do not depend on the parameters.
  • Think about a good design for censored
@twiecki
Copy link
Member

twiecki commented Mar 6, 2017

Thanks for the proposal @aseyboldt, I like it. Some thoughts:

Deprecate Bound
This makes sense to me. It didn't even occur to me that the way we do it now results in non-proper probability distributions.

add submodule distributions.tuncated containing truncated (and tested) versions of most distributions. Each with two additional parameters lower_trunc and upper_trunc.

I agree that the current implementation is confusing to users and this would be nice API. Would you imagine that there would be a similar wrapper like Bound and the truncated distributions would just be created using that? E.g. in distributions.truncated:
Normal = lambda *args, **kwargs: Bound(pm.Normal, *args, **kwargs), lower_trunc=kwargs['lower_trunc'], ...)
We wouldn't do it like that but just to communicate the rough idea. Because in that case we could probably also just create the truncated distributions on the fly on model import.

Implement lcdf, lccdf, linterval, lcinterval for most distributions

Sounds good. If there is buy-in from other packages we might also want to consider breaking this out into a separate distributions package.

Add propto argument to logp of distribution and drop summands that do not depend on the parameters.

Purely for perfomance reasons? Because I think theano should be smart enough to pre-compute all constants. Of course this only works if the normalisation does not depend on the parameters. In that case, I wonder if there is some way to detect these terms, in that case we could write a theano graph optimizer that removes them.

@domenzain
Copy link
Contributor

Currently only the log CDF is getting implemented in #1871.
Should we also directly implement the log CCDF?
We might gain something in the numerical stability.

@fonnesbeck
Copy link
Member

@domenzain I'm not opposed if there is general utility to having them available. I am just wary of adding (and then maintaining) methods that are rarely used except in highly specialized cases.

@domenzain
Copy link
Contributor

Now that there are a bunch of CDF methods available through #2688, could we try a generic Truncated?
In the meantime TruncatedNormal has appeared too, so clearly there are users with a need for this.

@junpenglao, I see you have worked a lot on transformations.py. Would you say this makes sense as a transformation?

@twiecki
Copy link
Member

twiecki commented Oct 10, 2018

@domenzain Yes, that sounds great!

@drbenvincent
Copy link

Something like pm.Censored and pm.Truncated would be very useful.

@fonnesbeck
Copy link
Member

fonnesbeck commented Nov 18, 2021

This is an old one. Does the BoundRV stuff in v4 along with #5169 take care of enough of this to close it?

@ricardoV94
Copy link
Member

This is an old one. Does the BoundRV stuff in v4 along with #5169 take care of enough of this to close it?

Yeah I think so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants