Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

censored() proposal #577

Closed
tomicapretto opened this issue Oct 19, 2022 · 8 comments · Fixed by #581
Closed

censored() proposal #577

tomicapretto opened this issue Oct 19, 2022 · 8 comments · Fixed by #581
Labels
Discussion Issue open for discussion, still not ready for a PR on it. enhancement

Comments

@tomicapretto
Copy link
Collaborator

We have #541 and #543 where models for censored data are discussed. Here I sketch the function we could use to handle censoring. It's inspired by Surv() and cens() from the survival and brms packages in R respectively.

import numpy as np
import pandas as pd

from formulae import design_matrices

Proposal 1

The function has three arguments. The first is the value of the variable, the second is the status ("left", "none", "right", and "interval"), and the third is an optional value that is passed when we use interval censoring.

def censored(values, status, right=None):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    values = np.asarray(values)
    status = np.asarray(status)
    assert len(values) == len(status)
    if right is not None:
        right = np.asarray(right)
        assert len(values) == len(right)

    left_values = values
    right_values = right
    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right_values is not None:
        result = np.column_stack([left_values, right_values, status])
    else:
        result = np.column_stack([left_values, status])
    return result

# Will allow us to do our stuff within Bambi
censored.__metadata__ = {"kind": "censored"}

Dataset 1 Right censoring

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = lifetime_true > 35
observed_lifetime = [value if value <= 35 else 35 for value in lifetime_true]
status = ["right" if value else "none" for value in censored_bool]
data = pd.DataFrame({"lifetime": observed_lifetime, "status": status})
print(data.head())
   lifetime status
0         4   none
1         3   none
2        35  right
3         3   none
4        35  right

Then we can use it as

dm = design_matrices("censored(lifetime, status) ~ 1", data)
print(dm.response)
print(np.asarray(dm.response)[:10])
ResponseMatrix  
  name: censored(lifetime, status)
  kind: numeric
  shape: (100, 2)

To access the actual design matrix do 'np.array(this_obj)'
[[ 4  0]
 [ 3  0]
 [35  1]
 [ 3  0]
 [35  1]
 [15  0]
 [21  0]
 [28  0]
 [ 1  0]
 [ 3  0]]

One "drawback" of this approach appears when we consider interval censoring.

Dataset 1 Interval censoring

We know the value is within an interval, but we don't know the exact value

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = np.logical_and(lifetime_true >= 10, lifetime_true <= 20)
observed_lifetime = [10 if value >= 10 and value <= 20 else value for value in lifetime_true]
status = ["interval" if value else "none" for value in censored_bool]
data2 = pd.DataFrame({"lower": observed_lifetime, "upper": 20, "status": status})
print(data2.head())
   lower  upper status
0      4     20   none
1      3     20   none
2     90     20   none
3      3     20   none
4    103     20   none
print(data2[data2["status"] == "interval"][:5])
    lower  upper    status
5      10     20  interval
20     10     20  interval
26     10     20  interval
35     10     20  interval
37     10     20  interval

Here the call would look like

dm = design_matrices("censored(lower, status, upper) ~ 1", data2)
print(dm.response)
print(np.array(dm.response))
ResponseMatrix  
  name: censored(lower, status, upper)
  kind: numeric
  shape: (100, 3)

To access the actual design matrix do 'np.array(this_obj)'
[[  4  20   0]
 [  3  20   0]
 [ 90  20   0]
 [  3  20   0]
 [103  20   0]]

It works well, but what makes it not very appealing to me is that we have "value", "status", "value" in the signature.

But I have another proposal, that I'm calling censored2() for now. If this becomes the chosen one, of course it will be named censored().

def censored2(*args):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    if len(args) == 2:
        left, status = args
        right = None
    elif len(args) == 3:
        left, right, status = args
    else:
        raise
    
    assert len(left) == len(status)

    if right is not None:
        right = np.asarray(right)
        assert len(left) == len(right)

    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right is not None:
        result = np.column_stack([left, right, status])
    else:
        result = np.column_stack([left, status])
    
    return result

Notice the only argument is an unnamed argument of variable length. Internally, we check it's of length 2 or 3.

This allows us to do

design_matrices("censored2(lower, upper, status) ~ 1", data2).response
ResponseMatrix  
  name: censored2(lower, upper, status)
  kind: numeric
  shape: (100, 3)

and

design_matrices("censored2(lifetime, status) ~ 1", data).response
ResponseMatrix  
  name: censored2(lifetime, status)
  kind: numeric
  shape: (100, 2)

which reads much better to me.


In summary, we have two candidate implementations for censored(). They both do the same, but they differ in the signature. One signature has 3 arguments well defined. The first is a value, the second a status, and the third is an optional value. The other signature has a single argument, which is an unnamed argument of variable length. Internally, we handle it differently depending on how many arguments we got. This allows the code to look like (value, status) and (value, value2, status) instead of (value, status, value2).


Note: This could be simplified a lot if we decide to support only left and right censoring. At the moment, PyMC supports only those out of the box. Interval censoring still requires more work on our end. I think it's still worth considering interval censoring from the very beginning, because it may be supported at some point.

@tomicapretto
Copy link
Collaborator Author

Tagging people who showed interest in the topic @don-jil, @ipa. Also tagging you @aloctavodia just in case you want to add something.

@tomicapretto tomicapretto added enhancement Discussion Issue open for discussion, still not ready for a PR on it. labels Oct 19, 2022
@zwelitunyiswa
Copy link

zwelitunyiswa commented Oct 19, 2022 via email

@tomicapretto
Copy link
Collaborator Author

censored2 seems the cleanest to me. Reminds me of the brms implementation.

On Wed, Oct 19, 2022 at 2:22 PM Tomás Capretto @.> wrote: Tagging people who showed interest in the topic @don-jil https://github.com/don-jil, @ipa https://github.com/ipa. Also tagging you @aloctavodia https://github.com/aloctavodia just in case you want to add something. — Reply to this email directly, view it on GitHub <#577 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH3QQV6RF4UZU7FGNFM4DQDWEA36BANCNFSM6AAAAAARJMLTO4 . You are receiving this because you are subscribed to this thread.Message ID: @.>

In brms it's y | cens(status, y2) ~ 1 if you want to have two values I think

@don-jil
Copy link

don-jil commented Oct 19, 2022

brms is what I'm familiar with, so that works for me. It will also help other folks who are migrating from R/brms.

@tomicapretto
Copy link
Collaborator Author

brms is what I'm familiar with, so that works for me. It will also help other folks who are migrating from R/brms.

The problem with using y | cens() is that the | operator has a very specific meaning in model formula language (i.e. group-specific effects). If I wanted to give it another meaning I would have to monkey-patch the implementation for this (and potentially others in the future) use cases and I don't think that's a good solution. Maintaining that will become a bigger problem later.

@ipa
Copy link

ipa commented Oct 19, 2022

To me the proposal for censored2 also looks cleanest. I agree with @tomicapretto with the issue on the | operator and would try to avoid that.

Cox proportional hazard models in R (which most are very familiar with) use Surv(lifetime, status) ~ 1 or Surv(lower, upper, status) ~ 1 for intervals or interval censoring, which would be equivalent to the censored2 proposal.

@don-jil
Copy link

don-jil commented Oct 19, 2022

That seems intuitive as well. :-)

@tomicapretto
Copy link
Collaborator Author

Thanks a lot @ipa, @don-jil, and @zwelitunyiswa for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issue open for discussion, still not ready for a PR on it. enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants