[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

avinashbarnwal · 2019-05-23T20:17:42Z

Motivation

XGBoost supports a different kind of loss functions ranging from least square to cox-proportional hazard model. We have rich support for linear regression, classification, count and survival loss models. But, adding more important survival and classification loss functions would improve the features and flexibility of the package.

Goals

Support survival model - Accelerated Failure Time for left, right and interval censored data.
Support count data with an upper bound, Binomial loss.

Non-Goals

This proposal is for XGBoost only. The interface proposed here is specific to R but can be generalized to other language bindings later.

Assumptions

Adding new attributes in the data matrix is allowed. This might change the properties of cox-proportional hazard model given that we might support for interval censored regression.

Risks
New hyperparameters sigma for tuning which might slow the computation for cross-validation.

Design

Existing Code

Survival Model - Cox-Proportional model is supported now where the label’s values are assigned accordingly where it is positive if the event happens and negative if the event doesn’t happen. This is for right censored data only.
Logistic regression - Logistic regression, Logistic regression for binary classification are supported having labels with {0,1} values.

Phase 1: Accelerated Failure Time(AFT)
Here we aim to support left, right and interval censored datasets.

New loss functions for Accelerated Failure time with normal and logistic distributions.
Interval regression requires two labels to be input, where left and right censored data can be represented in the form of interval-censored where the lower limit is -inf for left and higher limit is inf for right censored datasets. We add a new attribute to the DMatrix object to store the extra label.
For point event (un-censored label), lower limit equals to a higher limit.
New parameter is created to pass the distribution type, belonging to Normal or Logistic distribution.
Gradient, hessian and metric performance will be added for AFT loss functions.
Loss functions require extra parameter - sigma. This is going to be treated as one of the hyperparameters.

Phase 2: Binomial Loss for count data with an upper bound
Here we aim to add new loss function for binomial Loss.

Input is count as a label which is different from binary classification.
We have an upper bound on the count of response for each row differently. New attributed is added to the DMatrix object to store this upper bound count.
Gradient, hessian and metric performance will be added for this loss function.

cc @hcho3 @tdhock

Note. This work is being carried out under the Google Summer of Code 2019. See discussion at #4242.

hcho3 · 2019-05-23T20:28:52Z

For those unfamiliar with survival analysis: this survey paper provides an excellent overview of survival analysis. Found this paper thanks to @avinashbarnwal

tdhock · 2019-05-23T21:26:08Z

you wrote that AFT model requires "extra sigma for normal distribution as a parameter to be estimated and beta for logistic distribution" but it would be better if you use the standard parameterization, which has sigma for both distributions, and allows easy extension to other distributions as well. this is described in section 6.9 of the survival package manual http://members.cbio.mines-paristech.fr/~thocking/survival.pdf

tdhock · 2019-05-23T21:29:35Z

also see avinashbarnwal#1 which is a summary of calls with @avinashbarnwal our GSOC'19 student

avinashbarnwal · 2019-05-23T21:41:18Z

I will change it to standard parameterization-sigma for both distributions. I was referring to your document https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf for log-logistic where you have mentioned beta as the parameter.

hcho3 · 2020-04-02T10:46:03Z

AFT is now part of XGBoost: #4763

trivialfis · 2020-04-19T23:39:41Z

@hcho3 Can we close this?

hcho3 changed the title ~~Loss Functions - Survival Model and Binomial Loss with count upper bound~~ [RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound May 23, 2019

hcho3 self-assigned this May 23, 2019

hcho3 mentioned this issue May 23, 2019

Google Summer of Code: new loss functions in XGBoost #4242

Closed

hcho3 pinned this issue May 26, 2019

This was referenced Jul 3, 2019

Implementation of loss function depending on more than single output column #4556

Closed

[WIP] Add lower and upper bounds on the label for survival analysis #4650

Closed

[WIP] Add lower and upper bounds on the label for survival analysis #4651

Closed

hcho3 unpinned this issue Dec 20, 2019

hcho3 closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

avinashbarnwal commented May 23, 2019 •

edited

Loading

hcho3 commented May 23, 2019 •

edited

Loading

tdhock commented May 23, 2019

tdhock commented May 23, 2019

avinashbarnwal commented May 23, 2019

hcho3 commented Apr 2, 2020

trivialfis commented Apr 19, 2020

[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

Comments

avinashbarnwal commented May 23, 2019 • edited Loading

hcho3 commented May 23, 2019 • edited Loading

tdhock commented May 23, 2019

tdhock commented May 23, 2019

avinashbarnwal commented May 23, 2019

hcho3 commented Apr 2, 2020

trivialfis commented Apr 19, 2020

avinashbarnwal commented May 23, 2019 •

edited

Loading

hcho3 commented May 23, 2019 •

edited

Loading