`censored()` proposal #577

tomicapretto · 2022-10-19T18:21:21Z

We have #541 and #543 where models for censored data are discussed. Here I sketch the function we could use to handle censoring. It's inspired by Surv() and cens() from the survival and brms packages in R respectively.

import numpy as np
import pandas as pd

from formulae import design_matrices

Proposal 1

The function has three arguments. The first is the value of the variable, the second is the status ("left", "none", "right", and "interval"), and the third is an optional value that is passed when we use interval censoring.

def censored(values, status, right=None):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    values = np.asarray(values)
    status = np.asarray(status)
    assert len(values) == len(status)
    if right is not None:
        right = np.asarray(right)
        assert len(values) == len(right)

    left_values = values
    right_values = right
    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right_values is not None:
        result = np.column_stack([left_values, right_values, status])
    else:
        result = np.column_stack([left_values, status])
    return result

# Will allow us to do our stuff within Bambi
censored.__metadata__ = {"kind": "censored"}

Dataset 1 Right censoring

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = lifetime_true > 35
observed_lifetime = [value if value <= 35 else 35 for value in lifetime_true]
status = ["right" if value else "none" for value in censored_bool]
data = pd.DataFrame({"lifetime": observed_lifetime, "status": status})
print(data.head())

   lifetime status
0         4   none
1         3   none
2        35  right
3         3   none
4        35  right

Then we can use it as

dm = design_matrices("censored(lifetime, status) ~ 1", data)
print(dm.response)
print(np.asarray(dm.response)[:10])

ResponseMatrix  
  name: censored(lifetime, status)
  kind: numeric
  shape: (100, 2)

To access the actual design matrix do 'np.array(this_obj)'
[[ 4  0]
 [ 3  0]
 [35  1]
 [ 3  0]
 [35  1]
 [15  0]
 [21  0]
 [28  0]
 [ 1  0]
 [ 3  0]]

One "drawback" of this approach appears when we consider interval censoring.

Dataset 1 Interval censoring

We know the value is within an interval, but we don't know the exact value

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = np.logical_and(lifetime_true >= 10, lifetime_true <= 20)
observed_lifetime = [10 if value >= 10 and value <= 20 else value for value in lifetime_true]
status = ["interval" if value else "none" for value in censored_bool]
data2 = pd.DataFrame({"lower": observed_lifetime, "upper": 20, "status": status})
print(data2.head())

   lower  upper status
0      4     20   none
1      3     20   none
2     90     20   none
3      3     20   none
4    103     20   none

print(data2[data2["status"] == "interval"][:5])

    lower  upper    status
5      10     20  interval
20     10     20  interval
26     10     20  interval
35     10     20  interval
37     10     20  interval

Here the call would look like

dm = design_matrices("censored(lower, status, upper) ~ 1", data2)
print(dm.response)
print(np.array(dm.response))

ResponseMatrix  
  name: censored(lower, status, upper)
  kind: numeric
  shape: (100, 3)

To access the actual design matrix do 'np.array(this_obj)'
[[  4  20   0]
 [  3  20   0]
 [ 90  20   0]
 [  3  20   0]
 [103  20   0]]

It works well, but what makes it not very appealing to me is that we have "value", "status", "value" in the signature.

But I have another proposal, that I'm calling censored2() for now. If this becomes the chosen one, of course it will be named censored().

def censored2(*args):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    if len(args) == 2:
        left, status = args
        right = None
    elif len(args) == 3:
        left, right, status = args
    else:
        raise
    
    assert len(left) == len(status)

    if right is not None:
        right = np.asarray(right)
        assert len(left) == len(right)

    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right is not None:
        result = np.column_stack([left, right, status])
    else:
        result = np.column_stack([left, status])
    
    return result

Notice the only argument is an unnamed argument of variable length. Internally, we check it's of length 2 or 3.

This allows us to do

design_matrices("censored2(lower, upper, status) ~ 1", data2).response

ResponseMatrix  
  name: censored2(lower, upper, status)
  kind: numeric
  shape: (100, 3)

and

design_matrices("censored2(lifetime, status) ~ 1", data).response

ResponseMatrix  
  name: censored2(lifetime, status)
  kind: numeric
  shape: (100, 2)

which reads much better to me.

In summary, we have two candidate implementations for censored(). They both do the same, but they differ in the signature. One signature has 3 arguments well defined. The first is a value, the second a status, and the third is an optional value. The other signature has a single argument, which is an unnamed argument of variable length. Internally, we handle it differently depending on how many arguments we got. This allows the code to look like (value, status) and (value, value2, status) instead of (value, status, value2).

Note: This could be simplified a lot if we decide to support only left and right censoring. At the moment, PyMC supports only those out of the box. Interval censoring still requires more work on our end. I think it's still worth considering interval censoring from the very beginning, because it may be supported at some point.

The text was updated successfully, but these errors were encountered:

tomicapretto · 2022-10-19T18:22:11Z

Tagging people who showed interest in the topic @don-jil, @ipa. Also tagging you @aloctavodia just in case you want to add something.

zwelitunyiswa · 2022-10-19T18:25:00Z

censored2 seems the cleanest to me. Reminds me of the brms implementation.

…

On Wed, Oct 19, 2022 at 2:22 PM Tomás Capretto ***@***.***> wrote: Tagging people who showed interest in the topic @don-jil <https://github.com/don-jil>, @ipa <https://github.com/ipa>. Also tagging you @aloctavodia <https://github.com/aloctavodia> just in case you want to add something. — Reply to this email directly, view it on GitHub <#577 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH3QQV6RF4UZU7FGNFM4DQDWEA36BANCNFSM6AAAAAARJMLTO4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

tomicapretto · 2022-10-19T18:31:29Z

censored2 seems the cleanest to me. Reminds me of the brms implementation.
…
On Wed, Oct 19, 2022 at 2:22 PM Tomás Capretto @.> wrote: Tagging people who showed interest in the topic @don-jil https://github.com/don-jil, @ipa https://github.com/ipa. Also tagging you @aloctavodia https://github.com/aloctavodia just in case you want to add something. — Reply to this email directly, view it on GitHub <#577 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH3QQV6RF4UZU7FGNFM4DQDWEA36BANCNFSM6AAAAAARJMLTO4 . You are receiving this because you are subscribed to this thread.Message ID: @.>

In brms it's y | cens(status, y2) ~ 1 if you want to have two values I think

don-jil · 2022-10-19T20:18:35Z

brms is what I'm familiar with, so that works for me. It will also help other folks who are migrating from R/brms.

tomicapretto · 2022-10-19T20:54:47Z

brms is what I'm familiar with, so that works for me. It will also help other folks who are migrating from R/brms.

The problem with using y | cens() is that the | operator has a very specific meaning in model formula language (i.e. group-specific effects). If I wanted to give it another meaning I would have to monkey-patch the implementation for this (and potentially others in the future) use cases and I don't think that's a good solution. Maintaining that will become a bigger problem later.

ipa · 2022-10-19T21:10:49Z

To me the proposal for censored2 also looks cleanest. I agree with @tomicapretto with the issue on the | operator and would try to avoid that.

Cox proportional hazard models in R (which most are very familiar with) use Surv(lifetime, status) ~ 1 or Surv(lower, upper, status) ~ 1 for intervals or interval censoring, which would be equivalent to the censored2 proposal.

don-jil · 2022-10-19T23:07:26Z

That seems intuitive as well. :-)

tomicapretto · 2022-10-19T23:49:49Z

Thanks a lot @ipa, @don-jil, and @zwelitunyiswa for your input!

tomicapretto added enhancement Discussion Issue open for discussion, still not ready for a PR on it. labels Oct 19, 2022

tomicapretto mentioned this issue Oct 22, 2022

Implement censored() #581

Merged

tomicapretto closed this as completed in #581 Oct 22, 2022

tomicapretto mentioned this issue Aug 8, 2023

Censored Data in Bambi #541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`censored()` proposal #577

`censored()` proposal #577

tomicapretto commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

zwelitunyiswa commented Oct 19, 2022 via email

tomicapretto commented Oct 19, 2022

don-jil commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

ipa commented Oct 19, 2022

don-jil commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

censored() proposal #577

censored() proposal #577

Comments

tomicapretto commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

zwelitunyiswa commented Oct 19, 2022 via email

tomicapretto commented Oct 19, 2022

don-jil commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

ipa commented Oct 19, 2022

don-jil commented Oct 19, 2022

tomicapretto commented Oct 19, 2022

`censored()` proposal #577

`censored()` proposal #577