Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for survival analysis #543

Closed
wants to merge 8 commits into from
Closed

Conversation

ipa
Copy link

@ipa ipa commented Jul 6, 2022

This PR adds support for survival analysis using the Exponential distribution. It adds support for the exponential family and for censored data.

There is a notebook with examples using R's veteran dataset.

Issue: #541
Depends on: bambinos/formulae/issues/79

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@canyon289
Copy link
Collaborator

I want to say this is one of the best PRs ive ever seen in open source. Between the well formatted and readable code, tests, and docs I dont even know what more to ask for. Thanks so much for creating this

@tomicapretto
Copy link
Collaborator

@ipa thanks a lot for this fantastic PR!

As I mentioned in the formulae PR it's not needed to add a new transformation in there (unless we need to treat involved variables in a very special way). You can just add a surv function into the extra_namespace dictionary that's defined here

extra_namespace = {"c": c}

that namespace is made available when formulas are parsed:

self._design = design_matrices(formula, data, na_action, 1, extra_namespace)

I'm adding more comments as I review the code :)

@@ -0,0 +1,1587 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add something telling how Surv works in the model formula.


Reply via ReviewNB

@@ -0,0 +1,1587 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    lam = 1/np.exp(az.summary(idata)["mean"]["Intercept"])

az.summary() is being called multiple times. We could call it once and store the result in a variable such as idata_summary


Reply via ReviewNB

@@ -0,0 +1,1587 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6.    S0 = sp.stats.expon.sf

Could you add a little about the sf function? From SciPy docs it says it's the survival function.


Reply via ReviewNB

@@ -0,0 +1,1587 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.        "Surv(time,status) ~ 1 + B(trt_str, success='Test') + T(celltype_str, ref='Squamous')",

I really like you're using B() and T() in the model formula. Again, one sentence saying what they do would be great. I could help here if you want since I was the author of these shortcuts.


Reply via ReviewNB

@@ -0,0 +1,1587 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic chart. Looks very nice! Do you think we could think about a way to make it work with plot_cap? https://github.com//pull/517

That implies some modifications to the predict function. I'm not 100% sure if this is possible or not. But it's worth considering it since it would make things much easier.


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this out and see if it's possible.

@tomicapretto
Copy link
Collaborator

I've been reading a little about the documentation of the Surv function in the survival R package. By default, it results in right-censored data.

> library(survival)
> x <- Surv(veteran$time, veteran$status)
> attributes(x)
$dim
[1] 137   2

$dimnames
$dimnames[[1]]
NULL

$dimnames[[2]]
[1] "time"   "status"


$type
[1] "right"

$class
[1] "Surv"

As implemented in this PR, using the exponential family will always try to result in a survival model, but this shouldn't be the case. It should result in a survival model only when Surv is in the LHS of the model formula (I prefer lowercase surv or even other alternatives such as censored(y, status)). This requires more thinking of course.

@ipa
Copy link
Author

ipa commented Jul 9, 2022

@tomicapretto thanks for the great review and comments.

I was not aware of the extra_namespace but it makes perfect sense to add it there. I'll change that and also implement that a survival model is only results when Surv is in the LHS. We can use censored as you suggested to keep a more general terminology.

I will also work on the example notebook and add some extra information.

@tomicapretto
Copy link
Collaborator

@ipa feel free to ping me if you want help with something or have a question :)

@ipa
Copy link
Author

ipa commented Jul 11, 2022

@tomicapretto An easy but not really nice way to implement this is to check for if "censored" in self.term.name: while building the response or something similar.

After thinking about this a bit more I think it would be cleaner if we implement it similar to Proportion in the formulae package (introducing a Censored type). This would implement it again in the formulae package though. However, we could then use something like this while building the response:

if isinstance(self.term.data, Censored):
    __build_censored_response_distribution(...)

I tried to do this via extra_namespace but then the formulae package throws errors because the type is neither numeric nor any of the other supported types like Proportion, CategoricalBox etc.

@donaldbraman
Copy link

Just want to say how excited I am to see such rapid progress on this -- it will be a great addition, and will simplify a lot of work for folks (like me!) running survival analyses. If I can do anything besides cheer from the sidelines, just let me know. :-)

@don-jil
Copy link

don-jil commented Sep 20, 2022

Happy to help test and or document. Very much appreciate all the work going into this. :-)

@tomicapretto tomicapretto mentioned this pull request Oct 19, 2022
@donaldbraman
Copy link

So, just reviewing the various related issues, it looks like we need a way to let formulae know whether and how the data is left censored, right censored, or both. Is that the main barrier to implementation?

@tomicapretto
Copy link
Collaborator

@donaldbraman this is already implemented and it's possible to work with survival models in Bambi since #697 :)
I think I commented this in some issue but somehow it missed this.

You can have a look at this test

def test_censored_response():
data = bmb.load_data("kidney")
data["status"] = np.where(data["censored"] == 0, "none", "right")
# Model 1, with intercept
priors = {
"Intercept": bmb.Prior("Normal", mu=0, sigma=1),
"sex": bmb.Prior("Normal", mu=0, sigma=2),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
}
model = bmb.Model(
"censored(time, status) ~ 1 + sex + age", data, family="weibull", link="log", priors=priors
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")
# Model 2, without intercept
priors = {
"sex": bmb.Prior("Normal", mu=0, sigma=2),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
}
model = bmb.Model(
"censored(time, status) ~ 0 + sex + age", data, family="weibull", link="log", priors=priors
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")
# Model 3, with group-specific effects
priors = {
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
"sex": bmb.Prior("Normal", mu=0, sigma=1),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"1|patient": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("InverseGamma", alpha=5, beta=10)),
}
model = bmb.Model(
"censored(time, status) ~ 1 + sex + age + (1|patient)",
data,
family="weibull",
link="log",
priors=priors,
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")

@donaldbraman
Copy link

Amazing -- you just made my day! And thank you again for all your work on this.

@donaldbraman this is already implemented and it's possible to work with survival models in Bambi since #697 :) I think I commented this in some issue but somehow it missed this.

You can have a look at this test

def test_censored_response():
data = bmb.load_data("kidney")
data["status"] = np.where(data["censored"] == 0, "none", "right")
# Model 1, with intercept
priors = {
"Intercept": bmb.Prior("Normal", mu=0, sigma=1),
"sex": bmb.Prior("Normal", mu=0, sigma=2),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
}
model = bmb.Model(
"censored(time, status) ~ 1 + sex + age", data, family="weibull", link="log", priors=priors
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")
# Model 2, without intercept
priors = {
"sex": bmb.Prior("Normal", mu=0, sigma=2),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
}
model = bmb.Model(
"censored(time, status) ~ 0 + sex + age", data, family="weibull", link="log", priors=priors
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")
# Model 3, with group-specific effects
priors = {
"alpha": bmb.Prior("Gamma", alpha=3, beta=5),
"sex": bmb.Prior("Normal", mu=0, sigma=1),
"age": bmb.Prior("Normal", mu=0, sigma=1),
"1|patient": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("InverseGamma", alpha=5, beta=10)),
}
model = bmb.Model(
"censored(time, status) ~ 1 + sex + age + (1|patient)",
data,
family="weibull",
link="log",
priors=priors,
)
idata = model.fit(tune=100, draws=100, random_seed=121195)
model.predict(idata, kind="pps")
model.predict(idata, data=data, kind="pps")

@tomicapretto
Copy link
Collaborator

I'm closing this PR because it's already implemented (see this previous comment #543 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants